fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985)#20
Draft
not-matthias wants to merge 1 commit into
Draft
fix(callgrind): correct AArch64/PPC call-return detection for SP-equal frames (COD-2985)#20not-matthias wants to merge 1 commit into
not-matthias wants to merge 1 commit into
Conversation
…l frames On AArch64/PPC the call instruction leaves SP unchanged (return address in the link register), so a callee's shadow-stack entry frame sits at SP *equal* to its return target rather than strictly below it as on x86. Three defects in the call/return classifier corrupted Simulation flamegraphs on AArch64 only (malloc -> driftsort_main, dealloc/sin parenting the workload, etc.) by inverting call edges and fabricating '2 recursion clones: 1. CLG_(unwind_call_stack): minpops is meant to bound only SP-equal pops, but was decremented for SP-lower pops too. A callee's still-open sub-call frames exhausted the budget before its own SP-equal entry frame was reached, leaving it stuck. SP-lower frames now pop unconditionally without consuming the budget. 2. setup_bbcc return classifier: when the returning function had made SP-lower sub-calls, popcount_on_return defaulted to 1 and left a PLT stub+callee pair or a tail-call chain (which all share the SP-equal block) stuck. It now skips the SP-lower frames and pops the SP-equal block down to the deepest frame matching the return target. Keyed on where the top frame sits, so the x86 paths (SP-lower-top return, SP-deeper jump) are preserved exactly; only the AArch64/PPC SP-equal case is changed. 3. reconstruct_call_stack_from_native: seeded ret_addr was ips[frame+1], i.e. return_PC - 1 (VG_(get_StackTrace) reports the call insn, not the return PC), but the classifier matches against the return PC. Normalize with +1 so returns across mid-run-seeded frames are not demoted to jumps. COD-2985. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
On AArch64/PPC the call instruction leaves SP unchanged (return address in the
link register), so a callee's shadow-stack entry frame sits at SP equal to its
return target rather than strictly below it as on x86. Three defects in
callgrind's call/return classifier corrupted Simulation flamegraphs on AArch64
only — inverting call edges (
malloc → core::slice::sort::stable::driftsort_main,__rdl_dealloc/sinparenting the workload) and fabricating'2recursionclones of non-recursive functions.
Fixes COD-2985.
The three defects (
callgrind/callstack.c,callgrind/bbcc.c)CLG_(unwind_call_stack)— minpops accounting.minpopsis meant to boundonly SP-equal pops, but was decremented for SP-lower pops too. A callee's
still-open sub-call frames exhausted the budget before its own SP-equal entry
frame was reached, leaving that frame stuck (keeping the callee's context
active → inverted edges; never decrementing its recursion depth →
'2clones).SP-lower frames now pop unconditionally without consuming the budget.
setup_bbccreturn classifier. When the returning function had madeSP-lower sub-calls,
popcount_on_returndefaulted to 1 and left a PLTstub+callee pair or a tail-call chain (which share the SP-equal block) stuck.
The classifier is now keyed on where the top frame sits: an SP-lower returning
frame (every x86 return) is handled exactly as before; an SP-equal returning
frame (AArch64/PPC) pops the SP-equal block down to the deepest frame whose
recorded return address is the return target — covering both a PLT stub +
callee pair (same return address) and a tail-call chain
a→b→c(c'sretlands past b and a). The downstream assertion is relaxed to
popcount_on_return >= 0, since a return may now legitimately vacate onlySP-lower frames.
reconstruct_call_stack_from_native— seededret_addroff-by-one.VG_(get_StackTrace)reports caller frames at the last byte of the callinstruction (
return_PC − 1), but the classifier matches against the returnPC. Normalized with
+1so returns across frames seeded at a mid-runCALLGRIND_START_INSTRUMENTATION(as CodSpeed always does) aren't demoted tojumps.
Only the SP-equal code paths are changed; those exist only where the call
instruction leaves SP unchanged (AArch64
bl/blr, PPCb(c)l), so x86 ispreserved exactly.
Validation
AArch64 (the bug, fractal e2e benchmark):
malloc/__rdl_deallocweremis-rooted blobs (55% / 47% of the benchmark) and are now thin leaves
(5.5% / 3.7%) containing only allocator internals;
driftsort_mainis back undergraph_benchmark::analyze_fractal_treeat 27% (matches the x86 reference's 28%);0 infra→workload inversions; non-recursive functions no longer get
'2clones.Rust, C++ and Python e2e flamegraphs all correct.
x86 (no regression):
on ubuntu-22.04 and ubuntu-24.04, identical to master.
classifier + unwind are bit-identical to the originals.
ubuntu-latestbuilt against this branch: all flamegraphscorrect.
🤖 Generated with Claude Code