Fix VisualWakeWords on GAP9#187
Conversation
`PULP2DDWConvTemplate` inherited `computeTransientBuffersSize` from the regular-conv template, which sizes the scratch as `2 * 8 * ch_in * KH * KW` — independent of `H_in`. The upstream `pulp_nn_depthwise_*` kernel actually walks a column-shaped im2col of `dim_kernel_x * (dim_in_y + pad_top + pad_bot) + dim_kernel_x` bytes per core, so for any input taller than the inherited bound the kernel writes past its scratch into the next L1 tensor. With 8 channels and a 3×3 kernel the two formulas coincide at H=45 and diverge from H=46 on (e.g. VWW PASS_1's 1×8×48×48 DW reproduces the corruption with ~74% of outputs wrong). Override `computeTransientBuffersSize` in `PULP2DDWConvTemplate` to use the kernel's actual per-core formula × NUM_CORES. Add a `Kernels/Integer/Conv/DW_2D_RQ_8x16x16` repro test exercising the previously-overflowing geometry, and wire VWW + the new kernel into the GAP9 L2 single- and double-buffer tiling rosters so CI catches any future regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bump pulp-nn-mixed to 1d8eeee0 (PR pulp-platform#11), which fixes the `pulp_nn_linear_*` family: when `num_o_neurons` is not a multiple of 2*NUM_CORES, the trailing odd neuron's `if (lft_neurons …)` branch called `pulp_nn_bn_quant_*(sum, *pKappa, *pLambda, …)` instead of the per-core advanced `*k1, *lambda1` pointers, so every core wrote element 0's RQS parameters into its own neuron. Any layer with an odd channel count (e.g. VWW's 2-class classifier) ended up with the wrong kappa/lambda. Rebuild the prebuilt GAP9 library against the fixed source. Split the dropped Clang-only `-Wno-incompatible-pointer-types- discards-qualifiers` on PULPOpen into the two flags GCC accepts (`-Wno-incompatible-pointer-types`, `-Wno-discards-qualifiers`) and add the same `-Wno-incompatible-pointer-types` plus `-Wno-implicit-function-declaration` to the GAP9 target so `pulp-nn-mixed` builds clean from source under the GCC GAP9 toolchain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughConverted several transient-buffer sizing helpers from static to instance methods, added depthwise-specific per-core im2col sizing in PULP DW conv templates, updated tiled test configs to include a DW kernel and VisualWakeWords model entries, tweaked compiler warning flags, and advanced a pulp-nn-mixed submodule pointer. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/PULPOpen/Templates/ConvTemplate.py`:
- Around line 65-80: The hoisting path still calls
PULP2DConvTemplate.computeTransientBuffersSize directly so the depthwise
override in PULP2DDWConvTemplate is never used; update
PULP2DConvTemplate.hoistTransientBuffers to call the class/instance
computeTransientBuffersSize dynamically (e.g., self.computeTransientBuffersSize
or type(obj).computeTransientBuffersSize) and then pass each returned tuple into
ctxt.hoistTransientBuffer so that
PULP2DDWConvTemplate.computeTransientBuffersSize results are actually hoisted
via ctxt.hoistTransientBuffer.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7512f408-b7f3-4da9-8f16-434097f40bfb
📒 Files selected for processing (11)
Deeploy/Targets/PULPOpen/Templates/ConvTemplate.pyDeeployTest/Tests/Kernels/Integer/Conv/DW_2D_RQ_8x16x16/activations.npzDeeployTest/Tests/Kernels/Integer/Conv/DW_2D_RQ_8x16x16/inputs.npzDeeployTest/Tests/Kernels/Integer/Conv/DW_2D_RQ_8x16x16/network.onnxDeeployTest/Tests/Kernels/Integer/Conv/DW_2D_RQ_8x16x16/outputs.npzDeeployTest/test_gap9_tiled_config.pyDeeployTest/test_siracusa_tiled_config.pyTargetLibraries/GAP9/CMakeLists.txtTargetLibraries/GAP9/prebuilt/libpulp-nn-mixed.aTargetLibraries/PULPOpen/CMakeLists.txtTargetLibraries/third_party/pulp-nn-mixed
`hoistTransientBuffers` called `computeTransientBuffersSize` via the hard-coded class name (e.g. `PULP2DConvTemplate.computeTransient...`), which bypassed any subclass override regardless of the actual `self` type. As a result the `PULP2DDWConvTemplate` override added in 65eda17 was dead code: VWW codegen kept using the regular-conv formula instead of the depthwise per-core size. Convert `computeTransientBuffersSize` from `@staticmethod` to an instance method on each affected template (PULP 1D/2D conv, PULP 2D float conv + DW, PULP iSoftmax, generic iSoftmax pre-allocated) and call it via `self.computeTransientBuffersSize(...)` so dispatch goes through the actual class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`PULP1DDWConvTemplate` previously inherited `computeTransientBuffersSize` from the regular 1D conv template, which sizes scratch as `8 * 2 * ch_in * KY` — independent of `H_in`. The upstream `pulp_nn_depthwise_*` kernel walks a column-shaped im2col of `dim_kernel_y * (dim_in_y + pad_top + pad_bot) + dim_kernel_y` bytes per core, so for tall enough inputs the kernel overflows the inherited buffer. Add the depthwise per-core formula × NUM_CORES, mirroring the 2D fix in 65eda17. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
NITPICK: Update the link #11 to point at the pulp-nn-mixed repo PR. |
Victor-Jung
left a comment
There was a problem hiding this comment.
Changes look good to me. Thanks for the fixes! Why is the CI failing? Hasn't the Docker been updated with the new pulp-nn-mixed version? As soon as it passes, we can merge.
Summary
Two independent correctness bugs surfaced while bringing up the MLPerf VisualWakeWords (VWW) model on GAP9 with tiling + double buffering. Each is fixable on its own, and either alone is enough to corrupt VWW outputs. Both are fixed here, and VWW is wired into CI on GAP9 so we don't regress.
1. DWConv im2col scratch buffer is undersized on PULPOpen
PULP2DDWConvTemplatewas inheritingcomputeTransientBuffersSizefrom the regular-conv template, which sizes the scratch as2 * 8 * ch_in * KH * KW— independent ofH_in. The upstreampulp_nn_depthwise_*kernel actually walks a column-shaped im2col ofdim_kernel_x * (dim_in_y + pad_top + pad_bot) + dim_kernel_xbytes per core, so for any input taller than the inherited bound the kernel writes past its scratch into the next L1 tensor.With 8 channels and a 3×3 kernel the two formulas coincide at
H=45and diverge fromH=46onward — VWW PASS_1's1×8×48×48DW reproduces the corruption with ~74% of outputs wrong.Fix: override
computeTransientBuffersSizeinPULP2DDWConvTemplateto use the kernel's actual per-core formula ×NUM_CORES.Also adds a
Kernels/Integer/Conv/DW_2D_RQ_8x16x16repro test exercising the previously-overflowing geometry, so CI catches it.2.
pulp_nn_linear_*leftover-neuron RQS parameter aliasingWhen
num_o_neuronsis not a multiple of2*NUM_CORES, the trailing odd-neuronif (lft_neurons …)branch in thepulp_nn_linear_*family calledpulp_nn_bn_quant_*(sum, *pKappa, *pLambda, …)instead of the per-core advanced*k1, *lambda1pointers. Every core ended up writing element-0's RQS parameters into its own neuron, so any layer with an odd channel count (e.g. VWW's 2-class classifier) was silently quantized with the wrong kappa/lambda.Fix lives in
pulp-nn-mixed(PR #11 —1d8eeee0); this PR bumps the submodule and rebuilds the prebuilt GAP9 library against the fixed source.While in there: split the dropped Clang-only
-Wno-incompatible-pointer-types-discards-qualifierson PULPOpen into the two flags GCC accepts (-Wno-incompatible-pointer-types,-Wno-discards-qualifiers), and add the same-Wno-incompatible-pointer-typesplus-Wno-implicit-function-declarationto GAP9 sopulp-nn-mixedbuilds cleanly from source under the GCC GAP9 toolchain.PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.