Skip to content

flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment#761

Open
Oppen wants to merge 5 commits into
mainfrom
feat/flamegraph_enhancements
Open

flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment#761
Oppen wants to merge 5 commits into
mainfrom
feat/flamegraph_enhancements

Conversation

@Oppen

@Oppen Oppen commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fix a correctness bug where any dst=0 jump (loop back-edges, if/else,
    jump tables, self-tail-recursion) was unconditionally treated as a tail
    call, corrupting the tracked call stack. Now compares the containing
    function of current_pc vs next_pc via SymbolTable::lookup; only
    mutates the stack on a genuine cross-function tail call.
  • Replace the String-keyed HashMap<String, u64> fold with a call-graph
    trie (TrieNode arena, O(1) push/pop/count independent of stack depth).
    Symbol resolution/demangling is deferred to write_folded, memoized per
    address, instead of running per instruction — this is what makes
    flamegraph generation viable at all on real workloads (see Benchmark).
  • Add a raw hex-address-keyed output mode (--flamegraph-raw) plus a
    standalone script (scripts/enrich_flamegraph.py) that drives the system
    addr2line binary (one process, one pipe) to expand inlined call chains
    and resolve file:line, with no new crate dependency — the executor never
    shells out itself.
  • Extract the CLI's execute+flamegraph drive loop into
    executor::flamegraph (run_with_flamegraph/drive_with_flamegraph),
    shared between the CLI and future callers, with --cycle-budget and
    --flamegraph-checkpoint-cycles.
  • Enable debug = true for the recursion-bench guest so addr2line has
    DWARF to resolve against.

Benchmark

cli execute recursion.elf --private-input <blob>, 73-FRI-query
verification of an empty inner proof. /usr/bin/time -l, same machine,
same guest ELF and input across each pair of rows, only the CLI rebuilt.
"debug-recursion" = guest ELF built with debug = true (this PR); "no-debug"
= guest ELF built without it (main's own build config).

Execution only Flamegraph Overhead
This branch, debug-recursion 192.41s / 3.495 GB 222.40s / 3.517 GB +15.6% time / +0.63% mem
This branch, no-debug 195.75s / 3.436 GB 222.78s / 3.437 GB +13.8% time / +0.05% mem
main, debug-recursion 194.64s / 3.496 GB >10,493s (killed, incomplete) >52.9x time
main, no-debug 192.99s / 3.436 GB >7,139s (killed, incomplete) >36.0x time

main's fold calls format_stack() (demangles every frame on the call
stack) on every instruction, uncached. Both main runs were killed
without finishing, each after multiple hours, while this branch completes
the identical run in ~222s — the numbers above are lower bounds, not final
ones. This branch's flamegraph mode is at least 32-47x faster than
main's for the identical guest binary and input.

Peak RSS for the two killed main runs (0.95 GB / 0.44 GB) is not
comparable to the completed rows: RSS was still climbing linearly at kill
time (each run had processed far fewer guest instructions in its ~2-3
hours than the ~200s execution-only baseline processes in full), so it
understates main's true peak rather than showing it using less memory.

Flamegraph examples (branch, debug-recursion.elf):

Test plan

  • make test-executor: 166 asm + 20 flamegraph + 29 rust tests, green
  • Regression tests: intra-function jal/jalr x0, cross-function
    mutual tail calls, self-tail-recursion, dst=1 recursion unaffected,
    zero-size-symbol boundary pinned, raw-mode keys by address not name
  • cargo clippy/cargo fmt clean on executor and cli
  • --flamegraph-rawscripts/enrich_flamegraph.py verified
    end-to-end against a real cross-compiled RISC-V guest ELF with
    llvm-addr2line
  • Benchmarked against main on the same machine, same input

…fold

- Fix bug where any dst=0 jump (loop back-edges, if/else, jump tables,
  self-tail-recursion) was misattributed as a tail call, corrupting the
  tracked stack. Now compares the containing function of current_pc vs
  next_pc via SymbolTable::lookup before mutating the stack.
- Replace the per-cycle String-keyed HashMap fold with a call-graph trie
  (address-keyed, O(1) push/pop/count). Symbol resolution and demangling
  are postponed to write_folded, memoized per unique address, instead of
  running on every instruction.
- Add a raw (unresolved, hex-address-keyed) output mode plus an external
  Python script (scripts/enrich_flamegraph.py) that drives addr2line for
  inline-chain + file:line resolution, without adding a crate dependency.
- Extract the CLI's execute+flamegraph drive loop into
  executor::flamegraph (run_with_flamegraph/drive_with_flamegraph), shared
  by the CLI and any future caller, with a cycle budget and periodic
  checkpoint support.
- Enable debug info for the recursion-bench guest so addr2line resolution
  has DWARF to work with.

Sampling was prototyped and removed: in the trie design the only thing it
skips is a cheap integer increment, not the (already O(1)) stack push/pop,
so it bought neither time nor memory. A per-address control-flow
classification cache was also prototyped and reverted after measurement
showed it made things slightly slower, not faster.
@Oppen Oppen marked this pull request as draft July 1, 2026 20:30
Oppen added 2 commits July 1, 2026 18:03
Benchmarking against main (73-query recursion-guest verification, ~22.66B
cycles) showed the trie-based fold was only marginally faster than the
original String-keyed HashMap fold (14.0% vs 15.9% overhead over a
no-flamegraph baseline) — not the substantive win the rework was meant to
be. Reverting it: back to call_stack: Vec<u64> + stack_counts:
HashMap<String, u64>, resolving/demangling inline in process_logs as
before.

Kept:
- The tail-call misattribution fix (unrelated to trie vs HashMap).
- Raw-address output mode, now chosen at construction
  (new/new_raw) since the stack key is formatted eagerly per log again,
  rather than deferred to write time.
- The shared execute+flamegraph drive loop and its cycle-budget/checkpoint
  support, and the InstructionCache clone-once fix in
  drive_with_flamegraph — unrelated to trie vs HashMap and measurably
  reduced overhead on its own.
- scripts/enrich_flamegraph.py and the recursion-bench debug=true change.
@Oppen

Oppen commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Regular flamegraph:
flamegraph_regular

Enriched flamegraph (raw output + addr2line script):
flamegraph_raw_enriched

@Oppen Oppen marked this pull request as ready for review July 2, 2026 17:59
Oppen added 2 commits July 3, 2026 15:49
debug=true and debug=2 are equivalent, but explicit avoids ambiguity
when profiling the guest ELF with tools that expect full debug info.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants