Skip to content

Added new metric node_agent_collected_metrics along with new internal prom registry#309

Open
vishnukumarkvs wants to merge 1 commit into
coroot:mainfrom
vishnukumarkvs:prom-metric-for-agent-collected-metrics
Open

Added new metric node_agent_collected_metrics along with new internal prom registry#309
vishnukumarkvs wants to merge 1 commit into
coroot:mainfrom
vishnukumarkvs:prom-metric-for-agent-collected-metrics

Conversation

@vishnukumarkvs
Copy link
Copy Markdown
Contributor

@vishnukumarkvs vishnukumarkvs commented May 17, 2026

Hi,

To view the amount of metrics a coroot node agent scrapes, I have added a new prometheus metric node_agent_collected_metrics (Type Gauge) to calculate the amount of series it scrapes in a scrape interval.

I have also created a new metrics endpoint on a new port (default: 9093) to differentiate between agent internal metrics and the agent's ebpf scraped metrics. So, prometheus can scrape only agents internal metrics like node_agent_info, collected_metrics etc and skip ebpf metrics(which we send via http push to coroot)

@def
Copy link
Copy Markdown
Member

def commented May 18, 2026

I don't think we need this as a separate metric, especially on a dedicated endpoint. Also, this metric isn't very actionable, so even if you see a high number of metrics, it's not clear what to do next.

We could instead log some metric statistics in the remote writer, for example the top 10 metric names by number of series.

To understand how many metrics each machine is writing to Prometheus, you can use a query like:

count by (machine_id) ({machine_id!=""})

@vishnukumarkvs
Copy link
Copy Markdown
Contributor Author

I don't think we need this as a separate metric, especially on a dedicated endpoint. Also, this metric isn't very actionable, so even if you see a high number of metrics, it's not clear what to do next.

We could instead log some metric statistics in the remote writer, for example the top 10 metric names by number of series.

To understand how many metrics each machine is writing to Prometheus, you can use a query like:

count by (machine_id) ({machine_id!=""})

I agree. This didnt give me much info. I could see few anomalies but those were expected. Regarding top 10 metrics etc, if we use prometheus as metrics store instead of clickhouse, prometheus tsdb page gives all top 10 metrics , cardinality etc

I am using clickhouse as backend, so this info for me is lost. Fetching cardinality, running debug prom queries is tough. I am trying to switch to prometheus

BTW, I might know why the metrics cardinality blows up. This metric container_http_requests_duration_seconds_total_bucket has high cardinality. The mostly likely reason is AWS. We have services which talk to AWS endpoints like S3, cloudwatch etc. Whenever a request is made, the AWS backed resolves to a new ip as their ips are dynamic. More queries leads to more ips hence blowup. I am still validating this claim. Will share more info once I confirm

@def
Copy link
Copy Markdown
Member

def commented May 19, 2026

@vishnukumarkvs As for S3, this is a known issue. The agent should collapse public IP addresses and use the FQDN as the destination label: https://github.com/coroot/coroot-node-agent/blob/main/common/net_test.go#L35

@vishnukumarkvs
Copy link
Copy Markdown
Contributor Author

@vishnukumarkvs As for S3, this is a known issue. The agent should collapse public IP addresses and use the FQDN as the destination label: https://github.com/coroot/coroot-node-agent/blob/main/common/net_test.go#L35

ooh ok. Thanks for pointing it out. Then I will debug more. Once I shift to prometheus, I can view the whole picture

@vishnukumarkvs
Copy link
Copy Markdown
Contributor Author

@def , From here: https://github.com/coroot/coroot-node-agent/blob/main/common/net.go#L219 , I see we do fqdn collapse when the dns resolves to more than 2 public ips.

Most cases, aws endpoints can resolve to more than 2 public ips but can there be cases which it resolves to single ip? Because for fqdn logs.us-east-2.amazonaws.com:443, I see this destination. But at the sametime, I see lot of individual public ips as well

@def
Copy link
Copy Markdown
Member

def commented May 19, 2026

This gate should not be an issue in this case:

dig logs.us-east-2.amazonaws.com

; <<>> DiG 9.10.6 <<>> logs.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1635
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;logs.us-east-2.amazonaws.com.	IN	A

;; ANSWER SECTION:
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.10
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.175
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.190
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.150
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.180
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.25
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.55
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.172

@vishnukumarkvs
Copy link
Copy Markdown
Contributor Author

This gate should not be an issue in this case:

dig logs.us-east-2.amazonaws.com

; <<>> DiG 9.10.6 <<>> logs.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1635
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;logs.us-east-2.amazonaws.com.	IN	A

;; ANSWER SECTION:
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.10
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.175
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.190
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.150
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.180
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.25
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.55
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.172

Strange. Yes, then technically it should collapse to fqdn. But I see 3.146 and 3.128 series seperately as well

@vishnukumarkvs
Copy link
Copy Markdown
Contributor Author

@def , analyzing through AI. Is something like below race condition possible?

Race condition -- connection opens before DNS is seen

Look at the event loop in containers/registry.go:294-324. 
All events come through a single channel and are processed sequentially:

case ebpftracer.EventTypeConnectionOpen:              // line 294
    c.onConnectionOpen(... e.DstAddr, e.ActualDstAddr ...)
    //   ↳ calls getDomain(dst.IP()) at container.go:628
    //   ↳ looks up r.ip2fqdn[ip] at registry.go:492
    //   ↳ if not found → domain=nil → raw IPs used

// ... later in the same loop ...
case ebpftracer.EventTypeL7Request:                   // line 313
    ip2fqdn := c.onL7Request(...)                     // DNS parsed here
    for ip, domain := range ip2fqdn {
        r.ip2fqdn[ip] = domain                        // ip2fqdn populated HERE
    }

The ip2fqdn map is only populated when an EventTypeL7Request (DNS response) is processed. 
But EventTypeConnectionOpen and EventTypeL7Request arrive as separate eBPF events with no guaranteed ordering. If the TCP connect event arrives first:
1.⁠ ⁠EventTypeConnectionOpen fires
2.⁠ ⁠getDomain(dst.IP()) → r.ip2fqdn[ip] → nil (DNS response not yet processed)
3.⁠ ⁠NewDestinationKey(dst, actualDst, nil) → raw IPs used, new series created
4.⁠ ⁠EventTypeL7Request (DNS) arrives later → ip2fqdn updated, but too late -- the connection already has its key locked in

And your ClickHouse data from earlier proves this is happening for the same service:
•⁠  ⁠Row 2: destination:'logs.us-east-2.amazonaws.com:443' -- DNS was seen first, collapse worked
•⁠  ⁠Row 3: destination:'3.146.22.13:443' -- connection opened before DNS, collapse failed
Both rows are from the same fluent-bit pod talking to the same CloudWatch endpoint.

@def
Copy link
Copy Markdown
Member

def commented May 19, 2026

Race conditions are possible because perf_maps are per-CPU, so this scenario is theoretically possible. However, in that case, we would see metrics with IP addresses only for a short period of time, since the ip2fqdn mapping is stored globally and persists once resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants