Added new metric node_agent_collected_metrics along with new internal prom registry by vishnukumarkvs · Pull Request #309 · coroot/coroot-node-agent

vishnukumarkvs · 2026-05-17T12:29:41Z

Hi,

To view the amount of metrics a coroot node agent scrapes, I have added a new prometheus metric node_agent_collected_metrics (Type Gauge) to calculate the amount of series it scrapes in a scrape interval.

I have also created a new metrics endpoint on a new port (default: 9093) to differentiate between agent internal metrics and the agent's ebpf scraped metrics. So, prometheus can scrape only agents internal metrics like node_agent_info, collected_metrics etc and skip ebpf metrics(which we send via http push to coroot)

…etric node_agent_collected_metrics

def · 2026-05-18T11:56:42Z

I don't think we need this as a separate metric, especially on a dedicated endpoint. Also, this metric isn't very actionable, so even if you see a high number of metrics, it's not clear what to do next.

We could instead log some metric statistics in the remote writer, for example the top 10 metric names by number of series.

To understand how many metrics each machine is writing to Prometheus, you can use a query like:

count by (machine_id) ({machine_id!=""})

vishnukumarkvs · 2026-05-19T13:27:14Z

I don't think we need this as a separate metric, especially on a dedicated endpoint. Also, this metric isn't very actionable, so even if you see a high number of metrics, it's not clear what to do next.

We could instead log some metric statistics in the remote writer, for example the top 10 metric names by number of series.

To understand how many metrics each machine is writing to Prometheus, you can use a query like:
count by (machine_id) ({machine_id!=""})

I agree. This didnt give me much info. I could see few anomalies but those were expected. Regarding top 10 metrics etc, if we use prometheus as metrics store instead of clickhouse, prometheus tsdb page gives all top 10 metrics , cardinality etc

I am using clickhouse as backend, so this info for me is lost. Fetching cardinality, running debug prom queries is tough. I am trying to switch to prometheus

BTW, I might know why the metrics cardinality blows up. This metric container_http_requests_duration_seconds_total_bucket has high cardinality. The mostly likely reason is AWS. We have services which talk to AWS endpoints like S3, cloudwatch etc. Whenever a request is made, the AWS backed resolves to a new ip as their ips are dynamic. More queries leads to more ips hence blowup. I am still validating this claim. Will share more info once I confirm

def · 2026-05-19T13:33:42Z

@vishnukumarkvs As for S3, this is a known issue. The agent should collapse public IP addresses and use the FQDN as the destination label: https://github.com/coroot/coroot-node-agent/blob/main/common/net_test.go#L35

vishnukumarkvs · 2026-05-19T13:49:48Z

@vishnukumarkvs As for S3, this is a known issue. The agent should collapse public IP addresses and use the FQDN as the destination label: https://github.com/coroot/coroot-node-agent/blob/main/common/net_test.go#L35

ooh ok. Thanks for pointing it out. Then I will debug more. Once I shift to prometheus, I can view the whole picture

vishnukumarkvs · 2026-05-19T15:36:31Z

@def , From here: https://github.com/coroot/coroot-node-agent/blob/main/common/net.go#L219 , I see we do fqdn collapse when the dns resolves to more than 2 public ips.

Most cases, aws endpoints can resolve to more than 2 public ips but can there be cases which it resolves to single ip? Because for fqdn logs.us-east-2.amazonaws.com:443, I see this destination. But at the sametime, I see lot of individual public ips as well

def · 2026-05-19T15:39:58Z

This gate should not be an issue in this case:

dig logs.us-east-2.amazonaws.com

; <<>> DiG 9.10.6 <<>> logs.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1635
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;logs.us-east-2.amazonaws.com.	IN	A

;; ANSWER SECTION:
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.10
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.175
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.190
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.150
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.180
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.25
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.55
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.172

vishnukumarkvs · 2026-05-19T15:50:53Z

This gate should not be an issue in this case:

dig logs.us-east-2.amazonaws.com

; <<>> DiG 9.10.6 <<>> logs.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1635
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;logs.us-east-2.amazonaws.com.	IN	A

;; ANSWER SECTION:
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.10
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.175
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.190
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.150
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.180
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.25
logs.us-east-2.amazonaws.com. 15 IN	A	3.146.22.55
logs.us-east-2.amazonaws.com. 15 IN	A	3.128.56.172

Strange. Yes, then technically it should collapse to fqdn. But I see 3.146 and 3.128 series seperately as well

vishnukumarkvs · 2026-05-19T16:09:51Z

@def , analyzing through AI. Is something like below race condition possible?

Race condition -- connection opens before DNS is seen

Look at the event loop in containers/registry.go:294-324. 
All events come through a single channel and are processed sequentially:

case ebpftracer.EventTypeConnectionOpen:              // line 294
    c.onConnectionOpen(... e.DstAddr, e.ActualDstAddr ...)
    //   ↳ calls getDomain(dst.IP()) at container.go:628
    //   ↳ looks up r.ip2fqdn[ip] at registry.go:492
    //   ↳ if not found → domain=nil → raw IPs used

// ... later in the same loop ...
case ebpftracer.EventTypeL7Request:                   // line 313
    ip2fqdn := c.onL7Request(...)                     // DNS parsed here
    for ip, domain := range ip2fqdn {
        r.ip2fqdn[ip] = domain                        // ip2fqdn populated HERE
    }

The ip2fqdn map is only populated when an EventTypeL7Request (DNS response) is processed. 
But EventTypeConnectionOpen and EventTypeL7Request arrive as separate eBPF events with no guaranteed ordering. If the TCP connect event arrives first:
1.⁠ ⁠EventTypeConnectionOpen fires
2.⁠ ⁠getDomain(dst.IP()) → r.ip2fqdn[ip] → nil (DNS response not yet processed)
3.⁠ ⁠NewDestinationKey(dst, actualDst, nil) → raw IPs used, new series created
4.⁠ ⁠EventTypeL7Request (DNS) arrives later → ip2fqdn updated, but too late -- the connection already has its key locked in

And your ClickHouse data from earlier proves this is happening for the same service:
•⁠  ⁠Row 2: destination:'logs.us-east-2.amazonaws.com:443' -- DNS was seen first, collapse worked
•⁠  ⁠Row 3: destination:'3.146.22.13:443' -- connection opened before DNS, collapse failed
Both rows are from the same fluent-bit pod talking to the same CloudWatch endpoint.

def · 2026-05-19T16:19:51Z

Race conditions are possible because perf_maps are per-CPU, so this scenario is theoretically possible. However, in that case, we would see metrics with IP addresses only for a short period of time, since the ip2fqdn mapping is stored globally and persists once resolved

Added new prom internal registry for agent level metrics, added new m…

fe237f8

…etric node_agent_collected_metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new metric node_agent_collected_metrics along with new internal prom registry#309

Added new metric node_agent_collected_metrics along with new internal prom registry#309
vishnukumarkvs wants to merge 1 commit into
coroot:mainfrom
vishnukumarkvs:prom-metric-for-agent-collected-metrics

vishnukumarkvs commented May 17, 2026 •

edited

Loading

Uh oh!

def commented May 18, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

def commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

def commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

def commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vishnukumarkvs commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

def commented May 18, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

def commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

def commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

vishnukumarkvs commented May 19, 2026

Uh oh!

def commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vishnukumarkvs commented May 17, 2026 •

edited

Loading