Skip to content

feat(electric-telemetry): refine coarse process_type buckets in place#4397

Merged
alco merged 11 commits into
mainfrom
erik/process-subtype-attribute
Jun 10, 2026
Merged

feat(electric-telemetry): refine coarse process_type buckets in place#4397
alco merged 11 commits into
mainfrom
erik/process-subtype-attribute

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Refines the existing process_type telemetry attribute for the three coarse buckets that hide the most signal during overload (per recent investigations into long-mailbox spikes), instead of adding a separate process_subtype attribute.

Following review feedback (#4397 (review)): grouping by a bare erlang or supervisor is rarely useful, and the set of more specific names we'd replace them with is limited. So rather than carry the detail in a companion attribute, we fold it directly into process_type:

  • process_type = "erlang" → replaced by the registered name; else the first named $ancestor; else the initial_call MFA string (e.g. ":erlang.apply/2"). Falls back to erlang if none apply.
  • process_type = "supervisor" → replaced by the registered name; else the first named $ancestor; else the initial_call MFA string. Falls back to supervisor if none apply.
  • process_type = "logger_olp" → folded together with the handler id (its registered name) into a single value logger_olp:<handler_id> (e.g. logger_olp:default, logger_olp:otel_log_handler). Falls back to bare logger_olp for an unregistered process.

All other process_type values are unchanged. There is no separate process_subtype attribute — the refinement happens in place, keeping cardinality low (registered names + MFAs only; no pids, no dynamic registry tuples) while giving the drill-down those investigations needed.

Affected events: vm.monitor.long_gc, vm.monitor.long_schedule, vm.monitor.long_message_queue, process.memory, process.bin_memory.

Related issues

  • Source task: electric-sql/alco-agent-tasks#46
  • SRE motivation / investigation: electric-sql/alco-agent-tasks#45 — long-mailbox / overload investigations where the coarse process_type buckets (supervisor, erlang, logger_olp) hid the specific processes responsible.

Implementation notes

  • ElectricTelemetry.Processes.proc_type/1 now refines the coarse type in a single Process.info/2 call (which additionally fetches :registered_name). The previous proc_type_and_subtype/1 and proc_subtype/2 are removed.
  • sorted_groups/2 groups by type alone again.
  • :process_subtype is removed from the tags: lists and the event metadata in application_telemetry.ex and system_monitor.ex.

Test plan

  • Unit tests for each refinement bucket and its fallback paths (packages/electric-telemetry/test/electric/telemetry/processes_test.exs).
  • Full electric-telemetry suite passes (121/121).
  • Changeset updated: @core/electric-telemetry: minor.

🤖 Generated with Claude Code

@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.52941% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.68%. Comparing base (d921a9f) to head (586d668).
⚠️ Report is 61 commits behind head on main.

Files with missing lines Patch % Lines
...telemetry/lib/electric/telemetry/system_monitor.ex 27.27% 8 Missing ⚠️
...tric-telemetry/lib/electric/telemetry/processes.ex 95.65% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4397       +/-   ##
===========================================
+ Coverage   37.04%   56.68%   +19.63%     
===========================================
  Files         217      373      +156     
  Lines       17094    39676    +22582     
  Branches     5762    10976     +5214     
===========================================
+ Hits         6333    22491    +16158     
- Misses      10746    17114     +6368     
- Partials       15       71       +56     
Flag Coverage Δ
electric-telemetry 71.42% <73.52%> (?)
elixir 71.42% <73.52%> (?)
packages/agents 70.75% <ø> (-0.41%) ⬇️
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 66.92% <ø> (-18.50%) ⬇️
packages/agents-runtime 79.99% <ø> (?)
packages/agents-server 73.98% <ø> (-1.17%) ⬇️
packages/agents-server-ui 6.21% <ø> (+0.75%) ⬆️
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.83% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.46% <ø> (+19.41%) ⬆️
unit-tests 56.68% <73.52%> (+19.63%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented May 22, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Iteration 8. No new commits have landed since iteration 7 — HEAD still merges 586d6685e, and processes.ex / system_monitor.ex are byte-for-byte what iteration 7 reviewed. The only new activity is @alco's confirmation that the subtype was folded back into process_type (no new metric tag — the PR refines existing values in place). Nothing to re-review on the code; the PR remains ready, both open items are cosmetic/pre-existing and non-blocking.

What's Working Well

  • The design pivot is settled and matches the code. @alco's comment ("folded the subtype back into process_type … doesn't add a new metric tag, it simply refines the values") lines up with the tree: refine_type/2 (processes.ex:157-172) refines only :erlang/:supervisor/:logger_olp and sorted_groups/2 groups by type alone. No process_subtype tag is emitted anywhere.
  • Stringification is uniform across all five events. Every vm.monitor.* site emits process_type: to_string(type) (system_monitor.ex:46,58,88,134) and the process.memory / process.bin_memory paths do the same, so the tag value is always a binary regardless of which refine_type/2 branch fired.
  • Refinement stays low-cardinality and pid-safe. Values come from registered names, the first named $ancestor (guarded by is_atom(name) and not is_nil(name), processes.ex:181), and initial_call MFA strings — never pids or dynamic registry tuples. Everything derives from the single Process.info/2 call, and a just-exited pid degrades cleanly to :dead via Access on nil.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

MFA-string values keep a leading colon (carry-over, cosmetic)

File: processes.ex:197

Exception.format_mfa(:erlang, :apply, 2) -> ":erlang.apply/2", so an anonymous process surfaces as process_type = ":erlang.apply/2" (leading colon retained). This is pinned by the tests (processes_test.exs:54,59,64,93,170), so it reads as intentional — dashboard filters just need to include the colon. Stripping it (erlang.apply/2) would read more naturally in Honeycomb/Prometheus, but this is purely cosmetic and effectively accepted.

Dead pids can linger in long_message_queue_pids (pre-existing, minor)

File: system_monitor.ex:111-129

A pid is removed from the set only on a {:monitor, pid, :long_message_queue, false} message. If a monitored process crashes while over threshold, no false arrives, so its pid stays in the MapSet and :recheck_message_queues keeps the 200ms timer alive, re-resolving proc_type/1 on a dead pid each tick (harmless — log_long_message_queue_event/2 skips emission when Process.info(pid, :message_queue_len) returns nil). Unchanged from the prior map-based implementation, so not a regression; a Process.alive?/1 filter (or dropping pids when proc_type returns :dead) would let the timer wind down. Bounded and low-impact; non-blocking.

Issue Conformance

The implementation matches the PR description and @alco's confirmed approach: in-place refinement of the three coarse buckets via registered name -> first named $ancestor -> initial_call MFA, :logger_olp folded to logger_olp:<handler_id>, all other types unchanged, no separate attribute, across the five named events. Linked tasks (electric-sql/alco-agent-tasks#46, #45) are present. Changeset @core/electric-telemetry: minor is included and accurate (only its filename, process-subtype-attribute.md, still echoes the abandoned approach — harmless).

Previous Review Status

No code changed since iteration 7, so both prior conclusions stand:

  • Resolved earlier: stringification consistency (586d6685e), dead-process test pinning, and stale-subtype-on-recheck re-resolution (7026b78a9).
  • Still open (both non-blocking): leading-colon MFA rendering (cosmetic, test-pinned) and dead-pid lingering in the message-queue set (pre-existing).

Review iteration: 8 | 2026-06-09

@claude

claude Bot commented May 22, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Iteration 7. Only two commits landed since iteration 6 (864103a43): 940ba9c76 reverts the formatting churn in application_telemetry.ex so it matches main exactly and trims the proc_type/1 docstring, and 9d0f692e9 drops the explanatory comment above the :erlang/:supervisor refine_type/2 clause. Both are pure documentation/whitespace changes with no behavioral impact — the in-place process_type refinement reviewed at iteration 6 is unchanged. No blocking issues.

What's Working Well

  • The formatting revert is a genuine improvement. With the process_subtype tag gone, application_telemetry.ex no longer needed the multi-line reflows, and 940ba9c76 correctly returns it to match main — keeping the PR diff scoped to the actual behavior change.
  • The dead-process contract is test-pinned (correcting iteration 6, see Previous Review Status). processes_test.exs:105-120 (proc_type/1 for dead processes) asserts :dead for an exited pid, and refine_type/2's catch-all passes :dead through unchanged. Combined with sorted_groups/2 rejecting :dead, the contract is intact and covered.
  • nil info is handled safely on the hot path. When Process.info/2 returns nil for a just-exited pid, nil[:label] / get_in(nil, …) / info[:registered_name] all degrade to nil via Access, so proc_type/1 resolves to :dead and log_long_message_queue_event/3's with {:message_queue_len, _} <- guard skips emission. No crash on the recheck loop racing a dying process.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None.

Suggestions (Nice to Have)

proc_type/1 fetches expensive :binary introspection on already-overloaded processes every 200 ms

Files: processes.ex:143-146 + system_monitor.ex:118-129

info/1 fetches [:dictionary, :initial_call, :label, :memory, :binary, :registered_name], but type resolution only needs :dictionary, :initial_call, :label, and :registered_name:memory/:binary are consumed solely by type_and_memory/1. Process.info(pid, :binary) walks the process heap to enumerate every refc binary the process references and is explicitly documented as potentially expensive.

The iteration-6 redesign changed :recheck_message_queues to re-resolve proc_type(pid) fresh every 200 ms (previously it replayed a cached type with no info call). That trade buys subtype accuracy — a real improvement — but it now triggers the full :binary introspection every 200 ms on exactly the processes already drowning in their mailboxes, which is where that introspection is most costly. A lighter-weight info fetch for the proc_type/1-only callers (dropping :memory/:binary) would keep the accuracy win without adding heap-walking load to processes already in distress. Pre-existing for the per-event paths, newly per-tick for the recheck loop; non-blocking either way.

Stringification still differs between the two emission paths (carry-over)

Files: application_telemetry.ex:185,201 vs system_monitor.ex:45-47,57-59,87-89,133-135

process.memory / process.bin_memory emit process_type: to_string(map.type) (always a binary), while the vm.monitor.* events pass process_type: type raw — an atom() | binary() union since refine_type/2 can return either. Reporters stringify tag values, so nothing breaks, but wrapping the system_monitor values in to_string/1 would make the tag type predictable across all five events. Non-blocking.

MFA-string values keep a leading colon (carry-over, cosmetic)

File: processes.ex:186-202

Exception.format_mfa(:erlang, :apply, 2)":erlang.apply/2" (leading colon retained), so a fallback process surfaces as process_type = ":erlang.apply/2" and dashboard filters must include the colon. Stripping the leading : would read more naturally; cosmetic only.

refine_type/2 rationale comment was removed asymmetrically (minor)

File: processes.ex:157-170

9d0f692e9 removed the comment explaining why :erlang/:supervisor are refined, but the :logger_olp clause directly below kept its explanatory comment. A reader landing on the first clause now has no inline rationale while the second does. The changeset and docstring still cover the intent, so this is purely an in-file readability nit — a one-line comment would restore symmetry. Optional.

Issue Conformance

The PR description matches the implementation: the in-place refinement of the three coarse buckets, the registered_name → $ancestors → initial_call fallback chain, the logger_olp:<handler_id> folding, and the five affected events are all accurate. No separate process_subtype attribute is added. Linked tasks (electric-sql/alco-agent-tasks#46, #45) are present. Changeset (@core/electric-telemetry: minor) is present and its text describes the in-place design correctly; only the changeset filename (process-subtype-attribute.md) still reflects the abandoned approach — harmless.

Previous Review Status

Correcting iteration 6: the "Dead-process contract is no longer test-pinned" item was inaccurate. The proc_type/1 for dead processes describe block exists at processes_test.exs:105-120 and asserts :dead; the test file was not touched by either commit since iteration 6, so it was present at 864103a43 as well. No action needed there.

Still open (all non-blocking):

  • proc_type/1 fetching :binary on the 200 ms recheck loop (new this iteration, see above).
  • Stringification inconsistency between the two emission paths (carry-over).
  • Leading-colon MFA rendering (carry-over, cosmetic).
  • refine_type/2 comment asymmetry from 9d0f692e9 (new, minor).

Resolved earlier by the redesign (no change this iteration): the process_subtype: nil rendering concern and the companion-attribute complexity are gone; the stale-subtype-on-recheck concern is resolved by re-resolving proc_type(pid) fresh each tick (at the small cost noted above).


Review iteration: 7 | 2026-06-09

Adds a new low-cardinality `process_subtype` attribute alongside the
existing `process_type` on all telemetry events that today carry it
(`vm.monitor.long_{gc,schedule,message_queue}`, `process.memory`,
`process.bin_memory`).

For the three coarse `process_type` buckets that previously hid most
of the signal during overload, `process_subtype` is derived as:

  * `:supervisor`  -> registered name, else first atom in $ancestors
  * `:erlang`      -> registered name, else initial_call MFA string
  * `:logger_olp`  -> registered name (handler id)

For every other `process_type` value, `process_subtype` is `nil`.
The existing `process_type` taxonomy is unchanged, so Honeycomb boards
and alerts that group by it continue to work; `process_subtype` adds
a finer-grained drill-down without exploding cardinality.

Refs electric-sql/alco-agent-tasks#46.
term() is correct but uninformative — in practice the type is always
atom() | binary() (atoms cover :dead, :unknown, module atoms, and atom
labels; binaries cover the string-label case). Helps dialyzer downstream.
…ut nil

is_atom(nil) is true, so the previous clause order (`nil -> nil` before
the atom guard) was load-bearing — dropping it would silently turn into
`"nil"`. Rewrite the guard to match on `is_atom(name) and not is_nil(name)`
so the clause stands on its own.
…_subtype/1

Existing tests cover proc_type/1 returning :dead for an exited process,
but proc_type_and_subtype/1 and proc_subtype/1 weren't exercised for
that case. Implementation relies on Process.info/2 returning nil and
Access on nil cascading nils through every helper; lock that contract
down so a future refactor of info/1 doesn't silently change the answer.
@erik-the-implementer erik-the-implementer force-pushed the erik/process-subtype-attribute branch from 1bd5c76 to fef04de Compare June 1, 2026 13:34
@erik-the-implementer erik-the-implementer force-pushed the erik/process-subtype-attribute branch from 3c9935c to d780505 Compare June 1, 2026 15:03
@alco alco self-assigned this Jun 2, 2026

@robacourt robacourt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc_type already comes from a variety of places (process label or if not initial_module) so it seems natural to extend that to check other places if the result we get back is not helpful. for example, I don't think it's every useful to have erlang as the proc_type and we could fall back to what you have as the subtype in that case. same for supervisor.

The strong argument for a proc_sub_type would come from {"logger_olp", handler id} if there's an infinite amount of handler_ids, and if it's useful to group by logger_olp.

So I can potentially see the need for proc_subtype, but personally I wouldn't use it for erlang and supervisor because it's not particularly useful to group by erlang or supervisor and there should only be a limited number of the names we'd replace it with.

Or maybe we shouldn't have subtype at all. What are the handler ids? Are they limited and readable?

Per PR review feedback: the coarse :erlang and :supervisor types are too
generic to group by on their own, so refine them in-place to the registered
name / named $ancestor / initial-call MFA. Logger handler processes become
logger_olp:<handler_id>. Drops the separate process_subtype attribute and its
tags from all affected telemetry events.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erik-the-implementer erik-the-implementer changed the title feat(electric-telemetry): add process_subtype attribute for supervisor/erlang/logger_olp granularity feat(electric-telemetry): refine coarse process_type buckets in place Jun 9, 2026
application_telemetry.ex now matches main exactly (the process_subtype tags
are gone, so the multi-line reflows were pure churn). Also drop the
refinement-detail paragraph from proc_type/1's docstring.
Makes the system_monitor events consistent with application_telemetry, which
already emits to_string(map.type). The tag value is now predictably a binary
at every emission site rather than an atom|binary union.
@alco

alco commented Jun 9, 2026

Copy link
Copy Markdown
Member

@robacourt good call. I've folded the subtype back into process_type. So this PR doesn't add a new metric tag, it simply refines the values for some process types to be more descriptive.

@netlify

netlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 864103a
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a27e4c89a81990008a49740
😎 Deploy Preview https://deploy-preview-4397--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@robacourt robacourt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@alco alco merged commit c0ebd51 into main Jun 10, 2026
78 of 79 checks passed
@alco alco deleted the erik/process-subtype-attribute branch June 10, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants