Skip to content

Feature: Add unified tensor and args dump support#792

Open
vegetabledoww wants to merge 1 commit into
hw-native-sys:mainfrom
vegetabledoww:main
Open

Feature: Add unified tensor and args dump support#792
vegetabledoww wants to merge 1 commit into
hw-native-sys:mainfrom
vegetabledoww:main

Conversation

@vegetabledoww
Copy link
Copy Markdown

@vegetabledoww vegetabledoww commented May 18, 2026

Summary

This PR adds orchestrator-level argument and scalar value dump support to the tensor dump pipeline, introducing TensorDumpStage::KERNEL_ARGS_DUMP to capture per-dispatch kernel args metadata (dtype/shape/scalar/pack_mode) without copying full tensor payloads. --dump-tensor now enables metadata export (kernel_args_dump.json + tensor_dump.json) alongside existing tensor payload export (tensor_dump.bin). Supports A2A3 and A5 sim and onboard paths.

Design

  • Reuse existing tensor dump channel — shares the existing dump SHM, collector, ready/free queues, and host export pipeline.
  • New dump stage KERNEL_ARGS_DUMP — records per-dispatch kernel metadata (dtype/shape/scalar/pack_mode only, no payload copy). For scalar args, emits a record with pack_mode=SCALAR_VALUE. For tensor args in this stage, bypasses selective-mask filtering.
  • Scalar-safe JSON — scalar values emitted as strings to avoid precision loss or JSON number overflow.
  • Dual-platform coverage — supports A2A3 and A5 sim and onboard paths.
  • dispatch_id kept — retained in TensorDumpRecord and JSON for consumer (Insight Trace loader) compatibility.
  • No redundant helpersdump_kernel_args_for_dispatch and dump_orch_args_for_task removed; their logic lives in dump_tensors_for_task overloads.

Scope

~27 files across Python CLI/bindings, common config, A2A3 and A5 device/host/runtime layers.

Verification

PagedAttention SmallCase was used as the end-to-end validation case:

  • a2a3sim ✅
  • a5sim ✅
  • a2a3 onboard via task-submit ✅

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature to dump orchestrator-level arguments (tensors and scalars) into an args_dump.json file for the a2a3 and a5 platforms. Key changes include updating the CallConfig structure, Python bindings, and the TensorDumpCollector to handle a new ARGS_DUMP stage. Feedback identifies several critical issues: the collector currently records arena offsets instead of actual scalar values, and the JSON export is missing essential tensor metadata such as shapes and data pointers. Additionally, several leftover debug logs need to be removed, an indentation error in the a5 collector should be corrected, and the flush logic in the a2a3 executor should be updated to a loop for consistency across platforms.

e.ndims = rec.ndims;
memcpy(e.shapes.data(), rec.shapes, sizeof(e.shapes));
e.type = (rec.ndims == 0) ? "scalar" : "tensor";
e.scalar_value = (rec.ndims == 0 && rec.payload_size > 0 && rec.payload_offset > 0) ? rec.payload_offset : 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The scalar_value is being assigned the payload_offset, which is the offset into the circular arena where the value was written, not the actual 8-byte scalar value. The collector needs to read the uint64_t value from the arena using this offset (handling circular wrap if necessary) to correctly capture the argument value. Additionally, per repository rules, ensure this 64-bit unsigned integer is represented as a string in the JSON output to prevent precision loss in JavaScript-based parsers.

References
  1. When serializing 64-bit unsigned integers to JSON, represent them as strings to prevent precision loss in JavaScript-based parsers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

e.ndims = rec.ndims;
memcpy(e.shapes.data(), rec.shapes, sizeof(e.shapes));
e.type = (rec.ndims == 0) ? "scalar" : "tensor";
e.scalar_value = (rec.ndims == 0 && rec.payload_size > 0) ? rec.payload_offset : 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the a2a3 implementation, scalar_value is being set to the arena offset (payload_offset) instead of the actual value stored in the arena. This will result in incorrect values in the exported JSON. Ensure the resulting 64-bit value is serialized as a string.

References
  1. When serializing 64-bit unsigned integers to JSON, represent them as strings to prevent precision loss.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +623 to +625
json << " {\"arg_index\": " << e.arg_index << ", \"type\": \"" << e.type << "\", \"dtype\": \"" << dtype_name << "\", \"ndims\": " << static_cast<uint32_t>(e.ndims);
json << ", \"scalar_value\": " << e.scalar_value;
json << "}";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON export for orchestrator arguments is missing critical metadata for tensors. The PR description states that it should dump shapes and addresses, but the current implementation only exports arg_index, type, dtype, ndims, and scalar_value. For tensors, the shapes array and the data_ptr (address) should also be included in the JSON object. Note that the data_ptr (a 64-bit address) must be represented as a string to comply with repository standards.

References
  1. When serializing 64-bit unsigned integers (such as hardware addresses) to JSON, represent them as strings to prevent precision loss.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#endif
{
bool en = is_dump_args_enabled();
LOG_ERROR("ARGS_DEBUG: Thread %d is_dump_args_enabled=%d args_addr=%p", thread_idx, en, (void*)orch_args_cached_);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This LOG_ERROR with the ARGS_DEBUG prefix appears to be leftover debug code and should be removed before merging.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Flush the args dump buffer so the host collector can pick it up.
{
bool en = is_dump_args_enabled();
LOG_ERROR("ARGS_FLUSH: Thread %d enabled=%d", thread_idx, en);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This LOG_ERROR with the ARGS_FLUSH prefix appears to be leftover debug code and should be removed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +762 to +765
int orch_thread = thread_num_ - 1;
if (orch_thread >= sched_thread_num_ && orch_thread < thread_num_) {
dump_tensor_flush(orch_thread);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic here only flushes the last thread (thread_num_ - 1), but the comment above (line 757) says it should flush all threads from sched_thread_num_ to thread_num_ - 1. If there are multiple orchestrator threads, some buffers will not be flushed. This should be a loop to maintain consistency with the implementation in other runtimes like a5.

        for (int t = sched_thread_num_; t < thread_num_; t++) {
            dump_tensor_flush(t);
        }
References
  1. For files shared across different runtimes, maintain identical code for consistency.

Comment on lines +478 to +482
LOG_ERROR("A5_ARGS_FLUSH2: thread=%d count=%u", thread_idx, buf->count);
uint64_t buf_addr = reinterpret_cast<uint64_t>(buf);
uint32_t seq = s_dump_states[thread_idx]->current_buf_seq;
if (enqueue_dump_ready_buffer(thread_idx, buf_addr, seq) != 0) {
int enq_rc = enqueue_dump_ready_buffer(thread_idx, buf_addr, seq);
LOG_ERROR("A5_ARGS_ENQ: thread=%d rc=%d seq=%u", thread_idx, enq_rc, seq);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These LOG_ERROR statements with A5_ARGS_FLUSH2 and A5_ARGS_ENQ prefixes are clearly debug logs and should be removed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return 0;
}
} else {
auto export_start = std::chrono::steady_clock::now();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code inside the else block is not properly indented.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +1076 to +1079
json << " {\"arg_index\": " << e.arg_index << ", \"type\": \"" << e.type << "\", \"dtype\": \""
<< dtype_name << "\", \"ndims\": " << static_cast<uint32_t>(e.ndims);
json << ", \"scalar_value\": " << e.scalar_value;
json << "}";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON export for a5 is also missing tensor shapes and addresses (data_ptr), which are required for the feature to be fully functional. Ensure data_ptr is serialized as a string per repository rules for 64-bit integers.

References
  1. When serializing 64-bit unsigned integers (such as hardware addresses) to JSON, represent them as strings to prevent precision loss.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vegetabledoww
Copy link
Copy Markdown
Author

vegetabledoww commented May 21, 2026

Follow-up issue for the remaining Insight Trace gap: #837.

PR #792 adds orchestrator-level --dump-args / args_dump.json, while #837 tracks the missing per-kernel dispatch args dump needed for single-kernel replay.

@vegetabledoww vegetabledoww changed the title Add: enable_dump_args — orchestrator args dump Feature: Add unified tensor and args dump support May 25, 2026
@vegetabledoww vegetabledoww force-pushed the main branch 8 times, most recently from 4da037c to 362c571 Compare May 27, 2026 06:21
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends the tensor dump infrastructure to capture kernel-argument metadata (tensors, scalars, context pointers) at dispatch time, enabling per-dispatch visibility into kernel call signatures. It introduces a new KERNEL_ARGS_DUMP stage, extends dump records with dispatch/core/block identifiers, implements kernel-args recording and collection, and integrates instrumentation throughout the scheduler and executor across both a2a3 and a5 platforms.

Changes

Kernel Arguments Dump Feature

Layer / File(s) Summary
Schema and Dump Contracts
src/a2a3/platform/include/common/tensor_dump.h, src/a5/platform/include/common/tensor_dump.h
New dump stage KERNEL_ARGS_DUMP, and enums TensorDumpCoreType (AIC/AIV) and TensorDumpPackMode (tensor pointer, scalar value, context pointers). TensorDumpRecord and TensorDumpInfo extended with dispatch_id, core_id, block_idx, core_type, pack_mode while preserving 128-byte cache-line alignment.
Host Collection and Export Contracts
src/a2a3/platform/include/host/tensor_dump_collector.h, src/a5/platform/include/host/tensor_dump_collector.h
New KernelArgsDumpEntry struct capturing dispatch metadata and tensor shapes, kernel_args_entries_ member in TensorDumpCollector, and export_kernel_args_dump_file() method declaration.
Kernel Args Dispatch Recording
src/a2a3/platform/include/aicpu/tensor_dump_aicpu.h, src/a5/platform/include/aicpu/tensor_dump_aicpu.h
New dump_kernel_args_for_dispatch template function recording tensor/scalar arguments and context pointers with appropriate pack modes. Updated dump_tensors_for_task with separate arg_index and tensor_index counters to map signature positions to tensor payloads correctly.
Payload Handling and Metadata
src/a2a3/platform/src/aicpu/tensor_dump_aicpu.cpp, src/a5/platform/src/aicpu/tensor_dump_aicpu.cpp
dump_tensor_record branches on KERNEL_ARGS_DUMP stage to write raw uint64_t kernel-arg values instead of tensor data. Metadata records now populate dispatch_id, core_id, block_idx, core_type, pack_mode.
Host Collection and JSON Export
src/a2a3/platform/src/host/tensor_dump_collector.cpp, src/a5/platform/src/host/tensor_dump_collector.cpp
process_dump_buffer parses KERNEL_ARGS_DUMP records into KernelArgsDumpEntry entries with arena value extraction. export_kernel_args_dump_file writes kernel_args_dump.json, grouping by dispatch and sorting by arg index. Export early-exit and state clearing updated for kernel-args data.
Scheduler Dispatch Instrumentation
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp, src/a5/runtime/.../*
New kernel_args_dispatch_seq_ atomic counter in scheduler context, reset during deinit. dispatch_subtask_to_core conditionally records kernel args (when PTO2_PROFILING and dumping enabled), allocating unique dispatch IDs and selecting core type.
Runtime Executor Integration
src/a2a3/platform/onboard/aicpu/kernel.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp, src/a2a3/platform/onboard/host/device_runner.cpp, src/a2a3/platform/sim/host/device_runner.cpp, src/a5/platform/.../*
Refactored dump-enabled flag computation in kernel servers. Orchestrator thread flushes tensor dumps after dispatch. Executor deinit extends worker-thread flush and resets execution state fields. Simulation mode adds patchelf rpath patching and improved dlsym error handling.
Documentation and Testing
docs/dfx/tensor-dump.md, docs/testing.md, simpler_setup/tools/dump_viewer.py, tests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py
Tensor-dump docs updated to reflect strides/start_offset geometry; kernel_args_dump.json artifact documented. Profiling docs expanded to describe per-dispatch kernel args dumping. dump_viewer fixed to use fixed tensor_dump.json path. Tests validate kernel_args_dump.json presence and structure.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 A rabbit hops through dispatch lands, now tracing every call with care,
Kernel arguments bright recorded, per-dispatch metadata laid bare,
From tensor shapes to scalar values, each context pointer caught in flight,
The dump collector gathers all, and exports JSON to the light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main feature addition: unified tensor and args dump support, which is a clear and specific description of the primary change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly describes the changeset, covering the new KERNEL_ARGS_DUMP stage, scalar value dumping, JSON export, dual-platform support, and the scope of affected files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/dfx/tensor-dump.md`:
- Around line 38-40: The docs file contains duplicated CLI example lines (e.g.,
"python tests/st/<case>/test_<name>.py -p a5sim --dump-tensor" and "pytest
tests/st/<case> --platform a5sim --dump-tensor") in the code blocks around the
shown diffs and again at lines 47-55; remove the repeated identical lines so
each example appears only once, keeping one instance of each unique command
(e.g., keep a single "python tests/st/<case>/test_<name>.py -p a5sim
--dump-tensor" and a single "pytest tests/st/<case> --platform a5sim
--dump-tensor") and ensure the surrounding code block fences remain valid.

In `@docs/testing.md`:
- Line 113: Update the `--dump-tensor` option row in the table so its
description matches the updated behavior: mention that it dumps tensors and
per-dispatch kernel arguments during runtime execution (e.g., change the
description for `--dump-tensor` to "Dump tensors and per-dispatch kernel args
during runtime execution"). Locate the table entry referencing `--dump-tensor`
and replace the old single-word "Dump tensors" text with the expanded
description to align with nearby sections.
- Line 328: The artifact list item for outputs/<case>_<ts>/tensor_dump/ repeats
the same flag token; remove the duplicate so the parenthetical shows the flag
only once (e.g., change "(--dump-tensor, --dump-tensor)" to "(`--dump-tensor`)"
or to the correct pair if a different second flag was intended), updating the
text that contains the token `--dump-tensor` to a single, accurate mention.

In `@src/a2a3/platform/onboard/aicpu/kernel.cpp`:
- Around line 116-117: The device-side tensor-dump gate is only driven by
PROFILING_FLAG_DUMP_TENSOR via GET_PROFILING_FLAG and set_dump_tensor_enabled in
kernel.cpp, so CLI flag --dump-args (dump_args) never enables the dispatch
capture path; update the plumbing so dump_args toggles the same gate: either map
dump_args to set PROFILING_FLAG_DUMP_TENSOR at runtime or extend the gate to
check the dump_args state (e.g., read dump_args and call set_dump_tensor_enabled
or augment is_dump_tensor_enabled to OR in dump_args). Locate and modify the
call sites around set_dump_tensor_enabled, GET_PROFILING_FLAG,
is_dump_tensor_enabled, and the runtime flag handling for enable_dump_tensor_ /
dump_args to ensure --dump-args independently enables the dump gate.

In `@src/a2a3/platform/sim/host/device_runner.cpp`:
- Around line 112-117: create_temp_so_file() currently hard-fails (unlinks and
returns false) if the system call "patchelf --set-rpath
'/data/software/gcc-15/lib64' <path_buf>" fails, which makes startup
environment-dependent and prevents any subsequent dlopen attempts in
ensure_binaries_loaded(); change this to a best-effort behavior: remove the
fixed constant rpath usage (do not assume /data/software/gcc-15/lib64), attempt
patchelf only if an environment-derived runtime path is available, and if the
std::system call fails, log a non-fatal warning and continue (do not unlink
path_buf or return false) so callers like ensure_binaries_loaded() can proceed
to try dlopen; use the existing identifiers create_temp_so_file(), path_buf,
gcc_runtime and ensure_binaries_loaded() to locate and update the logic.

In `@src/a2a3/platform/src/host/tensor_dump_collector.cpp`:
- Around line 186-189: Clamp rec.ndims to PLATFORM_DUMP_MAX_DIMS before storing
and copying shapes: set entry.ndims = std::min(rec.ndims,
PLATFORM_DUMP_MAX_DIMS) (or equivalent) and then copy shapes only for d <
entry.ndims into entry.shapes; apply the same clamp-and-copy fix to the other
occurrence referenced (lines ~533-536) so serialization cannot read past
entry.shapes.
- Line 719: The vector kernel_args_entries_ is only cleared on normal export and
can leak stale entries when the finalize/reset path is taken; update the
finalize and reset code paths (e.g. the finalize() and/or reset() blocks
surrounding the existing kernel_args_entries_.clear() call) to also call
kernel_args_entries_.clear() (and any associated cleanup like freeing entries if
necessary) so that kernel_args_entries_ is emptied regardless of successful
export.

In `@src/a5/platform/src/host/tensor_dump_collector.cpp`:
- Around line 253-256: Clamp rec.ndims to PLATFORM_DUMP_MAX_DIMS before
assigning to entry.ndims and before copying into entry.shapes to prevent
writing/reading past the fixed-size buffer; specifically, replace the direct
assignment entry.ndims = rec.ndims with something like entry.ndims =
std::min(rec.ndims, PLATFORM_DUMP_MAX_DIMS) and use that clamped value in the
loop that copies rec.shapes into entry.shapes (the same change should be applied
to the other occurrence handling ndims/shapes around the rec use at the second
block mentioned).
- Line 763: In the TensorDumpCollector reset path, ensure kernel_args_entries_
is cleared alongside collected_: inside the same finalize/reset method where
collected_ is cleared (e.g., TensorDumpCollector::FinalizeReset or the
reset/Finalize method that currently calls collected_.clear()), add
kernel_args_entries_.clear(); so kernel argument entries cannot persist between
runs; keep placement and locking consistent with the existing collected_
clearing to preserve thread-safety.

In `@src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 710-714: The flush loop starts at sched_thread_num_ and so omits
scheduler threads (they use thread_idx < sched_thread_num_); change the loop in
the is_dump_tensor_enabled() block to iterate over all thread indices that may
emit dumps (e.g., start at 0 and loop to aicpu_thread_num_) so
dump_tensor_flush(t) is called for scheduler-thread indices as well; update the
loop that currently uses sched_thread_num_ as the start to cover the full range
of 0..aicpu_thread_num_-1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 319fdfc4-6541-418c-ae9e-6c9048793732

📥 Commits

Reviewing files that changed from the base of the PR and between 49d74e8 and c85c505.

📒 Files selected for processing (27)
  • docs/dfx/tensor-dump.md
  • docs/testing.md
  • simpler_setup/tools/dump_viewer.py
  • src/a2a3/platform/include/aicpu/tensor_dump_aicpu.h
  • src/a2a3/platform/include/common/tensor_dump.h
  • src/a2a3/platform/include/host/tensor_dump_collector.h
  • src/a2a3/platform/onboard/aicpu/kernel.cpp
  • src/a2a3/platform/onboard/host/device_runner.cpp
  • src/a2a3/platform/sim/host/device_runner.cpp
  • src/a2a3/platform/src/aicpu/tensor_dump_aicpu.cpp
  • src/a2a3/platform/src/host/tensor_dump_collector.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a5/platform/include/aicpu/tensor_dump_aicpu.h
  • src/a5/platform/include/common/tensor_dump.h
  • src/a5/platform/include/host/tensor_dump_collector.h
  • src/a5/platform/onboard/aicpu/kernel.cpp
  • src/a5/platform/sim/host/device_runner.cpp
  • src/a5/platform/src/aicpu/tensor_dump_aicpu.cpp
  • src/a5/platform/src/host/tensor_dump_collector.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • tests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py

Comment thread docs/dfx/tensor-dump.md
Comment thread docs/testing.md Outdated
Comment thread docs/testing.md Outdated
Comment on lines +116 to +117
bool dump_enabled = GET_PROFILING_FLAG(k_args->enable_profiling_flag, PROFILING_FLAG_DUMP_TENSOR);
set_dump_tensor_enabled(dump_enabled);
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate dump-related profiling flags and runtime gates.
rg -n -C2 'PROFILING_FLAG_DUMP_(TENSOR|ARGS)|set_dump_tensor_enabled|is_dump_tensor_enabled|dump_args|--dump-args'

# 2) Trace where enable_profiling_flag is assembled and propagated.
rg -n -C3 '\benable_profiling_flag\b|dump_args|dump_tensor'

# 3) Verify CLI/config surfaces for independent args-dump control.
rg -n -C3 '--dump-args|dump_args'

Repository: hw-native-sys/simpler

Length of output: 50820


Fix/clarify --dump-args plumbing: this device-side dump gate is driven only by PROFILING_FLAG_DUMP_TENSOR.

  • src/a2a3/platform/onboard/aicpu/kernel.cpp sets set_dump_tensor_enabled(GET_PROFILING_FLAG(..., PROFILING_FLAG_DUMP_TENSOR)), and the runtime dump points are guarded by is_dump_tensor_enabled() in the scheduler.
  • The only connected enable path for this gate is enable_dump_tensor_PROFILING_FLAG_DUMP_TENSORset_dump_tensor_enabled(...) (including the AICPU-side set_dump_tensor_enabled implementation).
  • No --dump-args / dump_args symbols were found, so --dump-args does not appear to independently toggle this capture gate; args-dump-only runs may therefore not enable the dispatch capture path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/onboard/aicpu/kernel.cpp` around lines 116 - 117, The
device-side tensor-dump gate is only driven by PROFILING_FLAG_DUMP_TENSOR via
GET_PROFILING_FLAG and set_dump_tensor_enabled in kernel.cpp, so CLI flag
--dump-args (dump_args) never enables the dispatch capture path; update the
plumbing so dump_args toggles the same gate: either map dump_args to set
PROFILING_FLAG_DUMP_TENSOR at runtime or extend the gate to check the dump_args
state (e.g., read dump_args and call set_dump_tensor_enabled or augment
is_dump_tensor_enabled to OR in dump_args). Locate and modify the call sites
around set_dump_tensor_enabled, GET_PROFILING_FLAG, is_dump_tensor_enabled, and
the runtime flag handling for enable_dump_tensor_ / dump_args to ensure
--dump-args independently enables the dump gate.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific concern is outdated under the current design. We removed the separate --dump-args control path and unified per-dispatch args capture under --dump-tensor, so there is no longer an independent args-only gate to plumb through kernel.cpp.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Resolve the outdated review comment (id from the comment URL: r3309124905)
gh api \
  --method PUT \
  -H "Accept: application/vnd.github+json" \
  "repos/hw-native-sys/simpler/pulls/792/reviews" 2>/dev/null || true

# Resolve the specific review comment thread
gh api \
  --method DELETE \
  -H "Accept: application/vnd.github+json" \
  "repos/hw-native-sys/simpler/pulls/comments/3309124905" 2>/dev/null || true

# List my comments to confirm the right one
gh api \
  -H "Accept: application/vnd.github+json" \
  "repos/hw-native-sys/simpler/pulls/792/comments" \
  --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[:80]}'

Length of output: 2267


🏁 Script executed:

#!/bin/bash
# Get the thread node ID for the comment with database id 3309124905
gh api graphql -f query='
{
  repository(owner: "hw-native-sys", name: "simpler") {
    pullRequest(number: 792) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              databaseId
              body
            }
          }
        }
      }
    }
  }
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3309124905) | {threadId: .id, isResolved: .isResolved}'

Length of output: 687


🏁 Script executed:

#!/bin/bash
gh api graphql -f query='
mutation {
  resolveReviewThread(input: {threadId: "PRRT_kwDORCBKLM6FBh-0"}) {
    thread {
      id
      isResolved
    }
  }
}'

Length of output: 677



@vegetabledoww, I wasn't able to resolve the thread programmatically (the integration lacks the required permissions), but I've noted the design decision for future reviews. Feel free to resolve the thread manually on your end. Thanks for the context!


✏️ Learnings added
Learnt from: vegetabledoww
Repo: hw-native-sys/simpler PR: 792
File: src/a2a3/platform/onboard/aicpu/kernel.cpp:116-117
Timestamp: 2026-05-27T09:11:48.030Z
Learning: In `hw-native-sys/simpler` PR `#792`, the separate `--dump-args` CLI flag/control path was removed. Per-dispatch kernel args capture is unified under `--dump-tensor`: enabling `--dump-tensor` (via `PROFILING_FLAG_DUMP_TENSOR` → `set_dump_tensor_enabled`) is the single gate for both tensor payload dumps and `kernel_args_dump.json` production. There is intentionally no independent `--dump-args` gate in `kernel.cpp` or elsewhere.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +112 to +117
const char *gcc_runtime = "/data/software/gcc-15/lib64";
int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
if (rc != 0) {
unlink(path_buf.data());
return false;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify hard dependency points in source
rg -n "patchelf --set-rpath|/data/software/gcc-15/lib64" --type=cpp

# Verify host/toolchain assumptions in current environment
if command -v patchelf >/dev/null 2>&1; then
  patchelf --version
else
  echo "patchelf not found in PATH"
fi

if [ -d /data/software/gcc-15/lib64 ]; then
  ls -ld /data/software/gcc-15/lib64
else
  echo "/data/software/gcc-15/lib64 does not exist"
fi

Repository: hw-native-sys/simpler

Length of output: 415


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/a2a3/platform/sim/host/device_runner.cpp"

# Find the function(s) that contain this snippet and where create_temp_so_file is defined/used
rg -n "create_temp_so_file" "$FILE"
rg -n "unlink\(path_buf\.data\(\)\)|patchelf --set-rpath|gcc_runtime" "$FILE"

# Show the surrounding code for the rpath block (lines ~90-140) and the full create_temp_so_file function
sed -n '90,150p' "$FILE"

# Extract create_temp_so_file body by printing from its signature until the next matching brace depth is too hard;
# instead, print a broader window around the first occurrence.
# This keeps output small while still giving context for return handling.
line=$(rg -n "create_temp_so_file" "$FILE" | head -n1 | cut -d: -f1)
start=$((line-60))
end=$((line+140))
sed -n "${start},${end}p" "$FILE" | sed -n '1,240p'

Repository: hw-native-sys/simpler

Length of output: 9981


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"

rg -n "LOG_WARN|LOG_INFO|LOG_DEBUG|LOG_ERROR" "$FILE" --type=cpp

Repository: hw-native-sys/simpler

Length of output: 5541


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"

sed -n '190,245p' "$FILE"
sed -n '295,345p' "$FILE"
sed -n '1080,1145p' "$FILE"

Repository: hw-native-sys/simpler

Length of output: 7519


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "LOG_WARN(_|\\b)|LOG_WARNING(_|\\b)|WARN\\(" -S . --glob '!**/node_modules/**' --glob '!**/dist/**' --glob '!**/build/**' --glob '!**/.git/**' --glob '!**/vendor/**'

Repository: hw-native-sys/simpler

Length of output: 11638


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"

# Show the include section and any logging headers used
sed -n '1,120p' "$FILE"

Repository: hw-native-sys/simpler

Length of output: 4682


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.h"
ls -l "$FILE" || true
sed -n '1,200p' "$FILE" 2>/dev/null || true
rg -n "LOG_WARN|unified_log|LOG_ERROR|LOG_DEBUG" "$FILE" --type=h

Repository: hw-native-sys/simpler

Length of output: 7852


Hard-failing patchelf + fixed /data/software/gcc-15/lib64 makes sim startup environment-dependent (blocks dlopen)

create_temp_so_file() unlinks the temp .so and returns false when std::system("patchelf --set-rpath ...") fails (lines ~112-117), and all callers treat that as fatal—so ensure_binaries_loaded() aborts before any dlopen() attempt.

💡 Suggested fix (best-effort patching instead of hard fail)
-    const char *gcc_runtime = "/data/software/gcc-15/lib64";
-    int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
-    if (rc != 0) {
-        unlink(path_buf.data());
-        return false;
-    }
+    const char *gcc_runtime = "/data/software/gcc-15/lib64";
+    int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
+    if (rc != 0) {
+        LOG_WARN("patchelf --set-rpath failed (rc=%d), continuing without rpath patch", rc);
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const char *gcc_runtime = "/data/software/gcc-15/lib64";
int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
if (rc != 0) {
unlink(path_buf.data());
return false;
}
const char *gcc_runtime = "/data/software/gcc-15/lib64";
int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
if (rc != 0) {
LOG_WARN("patchelf --set-rpath failed (rc=%d), continuing without rpath patch", rc);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 112 - 117,
create_temp_so_file() currently hard-fails (unlinks and returns false) if the
system call "patchelf --set-rpath '/data/software/gcc-15/lib64' <path_buf>"
fails, which makes startup environment-dependent and prevents any subsequent
dlopen attempts in ensure_binaries_loaded(); change this to a best-effort
behavior: remove the fixed constant rpath usage (do not assume
/data/software/gcc-15/lib64), attempt patchelf only if an environment-derived
runtime path is available, and if the std::system call fails, log a non-fatal
warning and continue (do not unlink path_buf or return false) so callers like
ensure_binaries_loaded() can proceed to try dlopen; use the existing identifiers
create_temp_so_file(), path_buf, gcc_runtime and ensure_binaries_loaded() to
locate and update the logic.

Comment thread src/a2a3/platform/src/host/tensor_dump_collector.cpp Outdated
Comment thread src/a2a3/platform/src/host/tensor_dump_collector.cpp
Comment thread src/a5/platform/src/host/tensor_dump_collector.cpp Outdated
Comment thread src/a5/platform/src/host/tensor_dump_collector.cpp
Comment thread src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
@vegetabledoww vegetabledoww force-pushed the main branch 3 times, most recently from bd7fd6d to ec69d20 Compare May 28, 2026 01:49
Restore per-dispatch kernel args dump generation, unify dump capture under --dump-tensor, and shrink the producer/export path so the resulting kernel_args_dump.json stays compatible with current PR821 consumption while avoiding redundant producer metadata.
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

逐点见行内评论。总方向:删掉独立 kernel_args_dump.json 链路,把 scalar 折叠进现有 tensor_dump.json(BEFORE_DISPATCH 不再跳过 scalar + 给每条 entry 加 kind 字段);格式保持「信息全」,由 PR #821 改 parser 来适配我们,而不是把我们的格式改复杂去迁就它。另外 patchelf / 缩进 / no-op 三处脏改动,无论方案如何都要删。

return false;
}

const char *gcc_runtime = "/data/software/gcc-15/lib64";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

硬编码本机绝对路径 /data/software/gcc-15/lib64 + std::system(patchelf ...),换台服务器 / CI 直接失效,而且和本特性毫无关系——这是调试残留,请删除整段。顺带这处还删掉了上面 fchmod 那行有用的注释,一并恢复。

if (aicpu_execute_func_ == nullptr || aicore_execute_func_ == nullptr || set_platform_regs_func_ == nullptr ||
set_platform_dump_base_func_ == nullptr || set_dump_tensor_enabled_func_ == nullptr ||
set_platform_pmu_base_func_ == nullptr || set_pmu_enabled_func_ == nullptr) {
set_platform_pmu_base_func_ == nullptr ||
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format 失手:这行被顶到了第 0 列,缩进错乱。clang-format -i 修一下即可。

set_platform_regs(k_args->regs);
set_platform_dump_base(k_args->dump_data_base);
set_dump_tensor_enabled(GET_PROFILING_FLAG(k_args->enable_profiling_flag, PROFILING_FLAG_DUMP_TENSOR));
bool dump_enabled = GET_PROFILING_FLAG(k_args->enable_profiling_flag, PROFILING_FLAG_DUMP_TENSOR);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no-op 改动:只是把原来一行 set_dump_tensor_enabled(GET_PROFILING_FLAG(...)) 拆成先存 bool dump_enabled 再传入,行为完全等价。请还原(a2a3 同处一致)。

for (int32_t sig_idx = 0; sig_idx < callable.sig_count(); sig_idx++) {
ArgDirection dir = callable.sig(sig_idx);
if (dir == ArgDirection::SCALAR) {
if (stage == TensorDumpStage::KERNEL_ARGS_DUMP) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关键点:现在是新开 KERNEL_ARGS_DUMP 分支来采 scalar。建议改为——在现有 BEFORE_DISPATCH 流程里别再 continue 跳过 SCALAR 槽,直接产一条 record(kind=scalarvalue=pl.scalars[scalar_index]、不进 .bin)。这样无需新 stage / 新调用点 / dispatch_id,产物统一进 tensor_dump.json。scalar 本来也有 before/after,它只是 dump 内容的扩展,不是新的产物。

enum class TensorDumpStage : uint8_t {
BEFORE_DISPATCH = 0,
AFTER_COMPLETION = 1,
KERNEL_ARGS_DUMP = 2,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorDumpStage 语义是「时间点」(before / after),而 KERNEL_ARGS_DUMP 是「模式 / 产物」,把它塞进 stage 枚举混淆了两个正交概念。建议删除这个 stage,scalar 改在现有 BEFORE_DISPATCH 一并采集。

std::vector<uint8_t> bytes;
};

struct KernelArgsDumpEntry {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelArgsDumpEntry 与已有的 DumpedTensor 字段大量重叠,属于结构体增殖。折叠方案下可直接复用 DumpedTensor / record,不必新增。

return (dt.ndims == 0) ? 1 : numel;
}

int TensorDumpCollector::export_kernel_args_dump_file() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议删除整个独立 kernel_args_dump.json 导出。对照唯一消费者 PR #821load_kernel_dump_args,它真正只读 per-arg 的 arg_index/kind/dtype/shape(tensor)和 value/pack_mode(scalar),这些和现有 tensor_dump.json 几乎完全重叠;ptr_value.bin、strides 它都不读。把 scalar 折叠进 tensor_dump.json 即可,不必再产一个文件。

first_arg = false;
json << "{\"arg_index\": " << entry->arg_index;
json << ", \"kind\": \"" << kernel_arg_kind_name(entry->pack_mode) << "\"";
json << ", \"pack_mode\": \"" << tensor_dump_pack_mode_name(entry->pack_mode) << "\"";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind(tensor/scalar)和 pack_mode(tensor_ptr/bits)由同一个字段派生,这里同时输出两个等价信息。建议只保留 kind;scalar 默认按 bits 解释即可。

(*p_func)(*orch_args_cached_);
rt_scope_end(rt);
if (is_dump_tensor_enabled()) {
dump_tensor_flush(thread_idx);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的 flush(这里每次 orch run 之后,以及 deinit 里循环)目的不清楚。scalar record 进的是同一套 per-thread arena,scheduler 循环末尾已有的 dump_tensor_flush 应已覆盖。除非能给出「不加就丢 record」的先失败 repro,否则建议还原。

Comment thread docs/dfx/tensor-dump.md
format on `a2a3` and `a5`. Both runtimes are wired through.

Enable in one line:
`strides`, `start_offset`, and `is_contiguous` so logical views are
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw_shape→strides 这处文档同步本身是对的(代码早就输出 strides),但它和本 PR 的 args dump 主题无关,属于夹带改动。建议在 PR 描述里注明是顺带的 doc-sync,或单独拆一个 commit。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants