Added new metric node_agent_collected_metrics along with new internal prom registry#309
Conversation
…etric node_agent_collected_metrics
|
I don't think we need this as a separate metric, especially on a dedicated endpoint. Also, this metric isn't very actionable, so even if you see a high number of metrics, it's not clear what to do next. We could instead log some metric statistics in the remote writer, for example the top 10 metric names by number of series. To understand how many metrics each machine is writing to Prometheus, you can use a query like: |
I agree. This didnt give me much info. I could see few anomalies but those were expected. Regarding top 10 metrics etc, if we use prometheus as metrics store instead of clickhouse, prometheus tsdb page gives all top 10 metrics , cardinality etc I am using clickhouse as backend, so this info for me is lost. Fetching cardinality, running debug prom queries is tough. I am trying to switch to prometheus BTW, I might know why the metrics cardinality blows up. This metric container_http_requests_duration_seconds_total_bucket has high cardinality. The mostly likely reason is AWS. We have services which talk to AWS endpoints like S3, cloudwatch etc. Whenever a request is made, the AWS backed resolves to a new ip as their ips are dynamic. More queries leads to more ips hence blowup. I am still validating this claim. Will share more info once I confirm |
|
@vishnukumarkvs As for S3, this is a known issue. The agent should collapse public IP addresses and use the FQDN as the destination label: https://github.com/coroot/coroot-node-agent/blob/main/common/net_test.go#L35 |
ooh ok. Thanks for pointing it out. Then I will debug more. Once I shift to prometheus, I can view the whole picture |
|
@def , From here: https://github.com/coroot/coroot-node-agent/blob/main/common/net.go#L219 , I see we do fqdn collapse when the dns resolves to more than 2 public ips. Most cases, aws endpoints can resolve to more than 2 public ips but can there be cases which it resolves to single ip? Because for fqdn logs.us-east-2.amazonaws.com:443, I see this destination. But at the sametime, I see lot of individual public ips as well |
|
This gate should not be an issue in this case: dig logs.us-east-2.amazonaws.com
; <<>> DiG 9.10.6 <<>> logs.us-east-2.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1635
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;logs.us-east-2.amazonaws.com. IN A
;; ANSWER SECTION:
logs.us-east-2.amazonaws.com. 15 IN A 3.146.22.10
logs.us-east-2.amazonaws.com. 15 IN A 3.128.56.175
logs.us-east-2.amazonaws.com. 15 IN A 3.128.56.190
logs.us-east-2.amazonaws.com. 15 IN A 3.128.56.150
logs.us-east-2.amazonaws.com. 15 IN A 3.128.56.180
logs.us-east-2.amazonaws.com. 15 IN A 3.146.22.25
logs.us-east-2.amazonaws.com. 15 IN A 3.146.22.55
logs.us-east-2.amazonaws.com. 15 IN A 3.128.56.172 |
Strange. Yes, then technically it should collapse to fqdn. But I see 3.146 and 3.128 series seperately as well |
|
@def , analyzing through AI. Is something like below race condition possible? |
|
Race conditions are possible because perf_maps are per-CPU, so this scenario is theoretically possible. However, in that case, we would see metrics with IP addresses only for a short period of time, since the ip2fqdn mapping is stored globally and persists once resolved |
Hi,
To view the amount of metrics a coroot node agent scrapes, I have added a new prometheus metric node_agent_collected_metrics (Type Gauge) to calculate the amount of series it scrapes in a scrape interval.
I have also created a new metrics endpoint on a new port (default: 9093) to differentiate between agent internal metrics and the agent's ebpf scraped metrics. So, prometheus can scrape only agents internal metrics like node_agent_info, collected_metrics etc and skip ebpf metrics(which we send via http push to coroot)