flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment by Oppen · Pull Request #761 · yetanotherco/lambda_vm

Oppen · 2026-07-01T20:19:14Z

Summary

Fix a correctness bug where any dst=0 jump (loop back-edges, if/else,
jump tables, self-tail-recursion) was unconditionally treated as a tail
call, corrupting the tracked call stack. Now compares the containing
function of current_pc vs next_pc via SymbolTable::lookup; only
mutates the stack on a genuine cross-function tail call.
Replace the String-keyed HashMap<String, u64> fold with a call-graph
trie (TrieNode arena, O(1) push/pop/count independent of stack depth).
Symbol resolution/demangling is deferred to write_folded, memoized per
address, instead of running per instruction — this is what makes
flamegraph generation viable at all on real workloads (see Benchmark).
Add a raw hex-address-keyed output mode (--flamegraph-raw) plus a
standalone script (scripts/enrich_flamegraph.py) that drives the system
addr2line binary (one process, one pipe) to expand inlined call chains
and resolve file:line, with no new crate dependency — the executor never
shells out itself.
Extract the CLI's execute+flamegraph drive loop into
executor::flamegraph (run_with_flamegraph/drive_with_flamegraph),
shared between the CLI and future callers, with --cycle-budget and
--flamegraph-checkpoint-cycles.
Enable debug = true for the recursion-bench guest so addr2line has
DWARF to resolve against.

Benchmark

cli execute recursion.elf --private-input <blob>, 73-FRI-query
verification of an empty inner proof. /usr/bin/time -l, same machine,
same guest ELF and input across each pair of rows, only the CLI rebuilt.
"debug-recursion" = guest ELF built with debug = true (this PR); "no-debug"
= guest ELF built without it (main's own build config).

	Execution only	Flamegraph	Overhead
This branch, debug-recursion	192.41s / 3.495 GB	222.40s / 3.517 GB	+15.6% time / +0.63% mem
This branch, no-debug	195.75s / 3.436 GB	222.78s / 3.437 GB	+13.8% time / +0.05% mem
`main`, debug-recursion	194.64s / 3.496 GB	>10,493s (killed, incomplete)	>52.9x time
`main`, no-debug	192.99s / 3.436 GB	>7,139s (killed, incomplete)	>36.0x time

main's fold calls format_stack() (demangles every frame on the call
stack) on every instruction, uncached. Both main runs were killed
without finishing, each after multiple hours, while this branch completes
the identical run in ~222s — the numbers above are lower bounds, not final
ones. This branch's flamegraph mode is at least 32-47x faster than
main's for the identical guest binary and input.

Peak RSS for the two killed main runs (0.95 GB / 0.44 GB) is not
comparable to the completed rows: RSS was still climbing linearly at kill
time (each run had processed far fewer guest instructions in its ~2-3
hours than the ~200s execution-only baseline processes in full), so it
understates main's true peak rather than showing it using less memory.

Flamegraph examples (branch, debug-recursion.elf):

Test plan

make test-executor: 166 asm + 20 flamegraph + 29 rust tests, green
Regression tests: intra-function jal/jalr x0, cross-function
mutual tail calls, self-tail-recursion, dst=1 recursion unaffected,
zero-size-symbol boundary pinned, raw-mode keys by address not name
cargo clippy/cargo fmt clean on executor and cli
--flamegraph-raw → scripts/enrich_flamegraph.py verified
end-to-end against a real cross-compiled RISC-V guest ELF with
llvm-addr2line
Benchmarked against main on the same machine, same input

…fold - Fix bug where any dst=0 jump (loop back-edges, if/else, jump tables, self-tail-recursion) was misattributed as a tail call, corrupting the tracked stack. Now compares the containing function of current_pc vs next_pc via SymbolTable::lookup before mutating the stack. - Replace the per-cycle String-keyed HashMap fold with a call-graph trie (address-keyed, O(1) push/pop/count). Symbol resolution and demangling are postponed to write_folded, memoized per unique address, instead of running on every instruction. - Add a raw (unresolved, hex-address-keyed) output mode plus an external Python script (scripts/enrich_flamegraph.py) that drives addr2line for inline-chain + file:line resolution, without adding a crate dependency. - Extract the CLI's execute+flamegraph drive loop into executor::flamegraph (run_with_flamegraph/drive_with_flamegraph), shared by the CLI and any future caller, with a cycle budget and periodic checkpoint support. - Enable debug info for the recursion-bench guest so addr2line resolution has DWARF to work with. Sampling was prototyped and removed: in the trie design the only thing it skips is a cheap integer increment, not the (already O(1)) stack push/pop, so it bought neither time nor memory. A per-address control-flow classification cache was also prototyped and reverted after measurement showed it made things slightly slower, not faster.

Benchmarking against main (73-query recursion-guest verification, ~22.66B cycles) showed the trie-based fold was only marginally faster than the original String-keyed HashMap fold (14.0% vs 15.9% overhead over a no-flamegraph baseline) — not the substantive win the rework was meant to be. Reverting it: back to call_stack: Vec<u64> + stack_counts: HashMap<String, u64>, resolving/demangling inline in process_logs as before. Kept: - The tail-call misattribution fix (unrelated to trie vs HashMap). - Raw-address output mode, now chosen at construction (new/new_raw) since the stack key is formatted eagerly per log again, rather than deferred to write time. - The shared execute+flamegraph drive loop and its cycle-budget/checkpoint support, and the InstructionCache clone-once fix in drive_with_flamegraph — unrelated to trie vs HashMap and measurably reduced overhead on its own. - scripts/enrich_flamegraph.py and the recursion-bench debug=true change.

…features" This reverts commit 114ad57.

Oppen · 2026-07-02T17:56:05Z

Regular flamegraph:

Enriched flamegraph (raw output + addr2line script):

debug=true and debug=2 are equivalent, but explicit avoids ambiguity when profiling the guest ELF with tools that expect full debug info.

Oppen marked this pull request as draft July 1, 2026 20:30

Oppen added 2 commits July 1, 2026 18:03

Revert "revert(flamegraph): drop the trie fold, keep the bug fix and …

43af257

…features" This reverts commit 114ad57.

Oppen marked this pull request as ready for review July 2, 2026 17:59

diegokingston approved these changes Jul 3, 2026

View reviewed changes

Oppen added 2 commits July 3, 2026 15:49

chore(recursion): use explicit debug=2 for recursion guest profile

ee45989

debug=true and debug=2 are equivalent, but explicit avoids ambiguity when profiling the guest ELF with tools that expect full debug info.

unsymbolized = same

6611b5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment#761

flamegraph: fix tail-call misattribution, trie-based fold, addr2line enrichment#761
Oppen wants to merge 5 commits into
mainfrom
feat/flamegraph_enhancements

Oppen commented Jul 1, 2026 •

edited

Loading

Uh oh!

Oppen commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Oppen commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test plan

Uh oh!

Oppen commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oppen commented Jul 1, 2026 •

edited

Loading