flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment#761
Open
Oppen wants to merge 5 commits into
Open
flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment#761Oppen wants to merge 5 commits into
Oppen wants to merge 5 commits into
Conversation
…fold - Fix bug where any dst=0 jump (loop back-edges, if/else, jump tables, self-tail-recursion) was misattributed as a tail call, corrupting the tracked stack. Now compares the containing function of current_pc vs next_pc via SymbolTable::lookup before mutating the stack. - Replace the per-cycle String-keyed HashMap fold with a call-graph trie (address-keyed, O(1) push/pop/count). Symbol resolution and demangling are postponed to write_folded, memoized per unique address, instead of running on every instruction. - Add a raw (unresolved, hex-address-keyed) output mode plus an external Python script (scripts/enrich_flamegraph.py) that drives addr2line for inline-chain + file:line resolution, without adding a crate dependency. - Extract the CLI's execute+flamegraph drive loop into executor::flamegraph (run_with_flamegraph/drive_with_flamegraph), shared by the CLI and any future caller, with a cycle budget and periodic checkpoint support. - Enable debug info for the recursion-bench guest so addr2line resolution has DWARF to work with. Sampling was prototyped and removed: in the trie design the only thing it skips is a cheap integer increment, not the (already O(1)) stack push/pop, so it bought neither time nor memory. A per-address control-flow classification cache was also prototyped and reverted after measurement showed it made things slightly slower, not faster.
Benchmarking against main (73-query recursion-guest verification, ~22.66B cycles) showed the trie-based fold was only marginally faster than the original String-keyed HashMap fold (14.0% vs 15.9% overhead over a no-flamegraph baseline) — not the substantive win the rework was meant to be. Reverting it: back to call_stack: Vec<u64> + stack_counts: HashMap<String, u64>, resolving/demangling inline in process_logs as before. Kept: - The tail-call misattribution fix (unrelated to trie vs HashMap). - Raw-address output mode, now chosen at construction (new/new_raw) since the stack key is formatted eagerly per log again, rather than deferred to write time. - The shared execute+flamegraph drive loop and its cycle-budget/checkpoint support, and the InstructionCache clone-once fix in drive_with_flamegraph — unrelated to trie vs HashMap and measurably reduced overhead on its own. - scripts/enrich_flamegraph.py and the recursion-bench debug=true change.
…features" This reverts commit 114ad57.
Collaborator
Author
diegokingston
approved these changes
Jul 3, 2026
debug=true and debug=2 are equivalent, but explicit avoids ambiguity when profiling the guest ELF with tools that expect full debug info.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dst=0jump (loop back-edges, if/else,jump tables, self-tail-recursion) was unconditionally treated as a tail
call, corrupting the tracked call stack. Now compares the containing
function of
current_pcvsnext_pcviaSymbolTable::lookup; onlymutates the stack on a genuine cross-function tail call.
String-keyedHashMap<String, u64>fold with a call-graphtrie (
TrieNodearena, O(1) push/pop/count independent of stack depth).Symbol resolution/demangling is deferred to
write_folded, memoized peraddress, instead of running per instruction — this is what makes
flamegraph generation viable at all on real workloads (see Benchmark).
--flamegraph-raw) plus astandalone script (
scripts/enrich_flamegraph.py) that drives the systemaddr2linebinary (one process, one pipe) to expand inlined call chainsand resolve file:line, with no new crate dependency — the executor never
shells out itself.
executor::flamegraph(run_with_flamegraph/drive_with_flamegraph),shared between the CLI and future callers, with
--cycle-budgetand--flamegraph-checkpoint-cycles.debug = truefor therecursion-benchguest soaddr2linehasDWARF to resolve against.
Benchmark
cli execute recursion.elf --private-input <blob>, 73-FRI-queryverification of an
emptyinner proof./usr/bin/time -l, same machine,same guest ELF and input across each pair of rows, only the CLI rebuilt.
"debug-recursion" = guest ELF built with
debug = true(this PR); "no-debug"= guest ELF built without it (
main's own build config).main, debug-recursionmain, no-debugmain's fold callsformat_stack()(demangles every frame on the callstack) on every instruction, uncached. Both
mainruns were killedwithout finishing, each after multiple hours, while this branch completes
the identical run in ~222s — the numbers above are lower bounds, not final
ones. This branch's flamegraph mode is at least 32-47x faster than
main's for the identical guest binary and input.Peak RSS for the two killed
mainruns (0.95 GB / 0.44 GB) is notcomparable to the completed rows: RSS was still climbing linearly at kill
time (each run had processed far fewer guest instructions in its ~2-3
hours than the ~200s execution-only baseline processes in full), so it
understates
main's true peak rather than showing it using less memory.Flamegraph examples (branch, debug-recursion.elf):
Test plan
make test-executor: 166 asm + 20 flamegraph + 29 rust tests, greenjal/jalr x0, cross-functionmutual tail calls, self-tail-recursion,
dst=1recursion unaffected,zero-size-symbol boundary pinned, raw-mode keys by address not name
cargo clippy/cargo fmtclean onexecutorandcli--flamegraph-raw→scripts/enrich_flamegraph.pyverifiedend-to-end against a real cross-compiled RISC-V guest ELF with
llvm-addr2linemainon the same machine, same input