feat(electric-telemetry): refine coarse process_type buckets in place#4397
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4397 +/- ##
===========================================
+ Coverage 37.04% 56.68% +19.63%
===========================================
Files 217 373 +156
Lines 17094 39676 +22582
Branches 5762 10976 +5214
===========================================
+ Hits 6333 22491 +16158
- Misses 10746 17114 +6368
- Partials 15 71 +56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummaryIteration 8. No new commits have landed since iteration 7 — What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)MFA-string values keep a leading colon (carry-over, cosmetic)File:
Dead pids can linger in
|
Claude Code ReviewSummaryIteration 7. Only two commits landed since iteration 6 ( What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)
|
Adds a new low-cardinality `process_subtype` attribute alongside the
existing `process_type` on all telemetry events that today carry it
(`vm.monitor.long_{gc,schedule,message_queue}`, `process.memory`,
`process.bin_memory`).
For the three coarse `process_type` buckets that previously hid most
of the signal during overload, `process_subtype` is derived as:
* `:supervisor` -> registered name, else first atom in $ancestors
* `:erlang` -> registered name, else initial_call MFA string
* `:logger_olp` -> registered name (handler id)
For every other `process_type` value, `process_subtype` is `nil`.
The existing `process_type` taxonomy is unchanged, so Honeycomb boards
and alerts that group by it continue to work; `process_subtype` adds
a finer-grained drill-down without exploding cardinality.
Refs electric-sql/alco-agent-tasks#46.
term() is correct but uninformative — in practice the type is always atom() | binary() (atoms cover :dead, :unknown, module atoms, and atom labels; binaries cover the string-label case). Helps dialyzer downstream.
…ut nil is_atom(nil) is true, so the previous clause order (`nil -> nil` before the atom guard) was load-bearing — dropping it would silently turn into `"nil"`. Rewrite the guard to match on `is_atom(name) and not is_nil(name)` so the clause stands on its own.
…_subtype/1 Existing tests cover proc_type/1 returning :dead for an exited process, but proc_type_and_subtype/1 and proc_subtype/1 weren't exercised for that case. Implementation relies on Process.info/2 returning nil and Access on nil cascading nils through every helper; lock that contract down so a future refactor of info/1 doesn't silently change the answer.
1bd5c76 to
fef04de
Compare
3c9935c to
d780505
Compare
There was a problem hiding this comment.
proc_type already comes from a variety of places (process label or if not initial_module) so it seems natural to extend that to check other places if the result we get back is not helpful. for example, I don't think it's every useful to have erlang as the proc_type and we could fall back to what you have as the subtype in that case. same for supervisor.
The strong argument for a proc_sub_type would come from {"logger_olp", handler id} if there's an infinite amount of handler_ids, and if it's useful to group by logger_olp.
So I can potentially see the need for proc_subtype, but personally I wouldn't use it for erlang and supervisor because it's not particularly useful to group by erlang or supervisor and there should only be a limited number of the names we'd replace it with.
Or maybe we shouldn't have subtype at all. What are the handler ids? Are they limited and readable?
Per PR review feedback: the coarse :erlang and :supervisor types are too generic to group by on their own, so refine them in-place to the registered name / named $ancestor / initial-call MFA. Logger handler processes become logger_olp:<handler_id>. Drops the separate process_subtype attribute and its tags from all affected telemetry events. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
application_telemetry.ex now matches main exactly (the process_subtype tags are gone, so the multi-line reflows were pure churn). Also drop the refinement-detail paragraph from proc_type/1's docstring.
Makes the system_monitor events consistent with application_telemetry, which already emits to_string(map.type). The tag value is now predictably a binary at every emission site rather than an atom|binary union.
|
@robacourt good call. I've folded the subtype back into process_type. So this PR doesn't add a new metric tag, it simply refines the values for some process types to be more descriptive. |
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Summary
Refines the existing
process_typetelemetry attribute for the three coarse buckets that hide the most signal during overload (per recent investigations into long-mailbox spikes), instead of adding a separateprocess_subtypeattribute.Following review feedback (#4397 (review)): grouping by a bare
erlangorsupervisoris rarely useful, and the set of more specific names we'd replace them with is limited. So rather than carry the detail in a companion attribute, we fold it directly intoprocess_type:process_type = "erlang"→ replaced by the registered name; else the first named$ancestor; else theinitial_callMFA string (e.g.":erlang.apply/2"). Falls back toerlangif none apply.process_type = "supervisor"→ replaced by the registered name; else the first named$ancestor; else theinitial_callMFA string. Falls back tosupervisorif none apply.process_type = "logger_olp"→ folded together with the handler id (its registered name) into a single valuelogger_olp:<handler_id>(e.g.logger_olp:default,logger_olp:otel_log_handler). Falls back to barelogger_olpfor an unregistered process.All other
process_typevalues are unchanged. There is no separateprocess_subtypeattribute — the refinement happens in place, keeping cardinality low (registered names + MFAs only; no pids, no dynamic registry tuples) while giving the drill-down those investigations needed.Affected events:
vm.monitor.long_gc,vm.monitor.long_schedule,vm.monitor.long_message_queue,process.memory,process.bin_memory.Related issues
electric-sql/alco-agent-tasks#46electric-sql/alco-agent-tasks#45— long-mailbox / overload investigations where the coarseprocess_typebuckets (supervisor,erlang,logger_olp) hid the specific processes responsible.Implementation notes
ElectricTelemetry.Processes.proc_type/1now refines the coarse type in a singleProcess.info/2call (which additionally fetches:registered_name). The previousproc_type_and_subtype/1andproc_subtype/2are removed.sorted_groups/2groups bytypealone again.:process_subtypeis removed from thetags:lists and the event metadata inapplication_telemetry.exandsystem_monitor.ex.Test plan
packages/electric-telemetry/test/electric/telemetry/processes_test.exs).electric-telemetrysuite passes (121/121).@core/electric-telemetry: minor.🤖 Generated with Claude Code