KernelPilot is now a lightweight prompt repository for standalone GPU-kernel optimization tasks. The diffusion tasks target SGLang diffusion operators, but the optimization and benchmark workspace does not patch, import, or monkey-patch SGLang at runtime.
For each diffusion task, the agent must copy the relevant upstream SGLang kernel
source from the latest upstream SGLang main commit into the task's baseline/
directory, expose that copied baseline through the same low-overhead ABI used by
the candidate, then benchmark baseline and candidate side by side inside the
task directory.
docs/
diffusion_correctness_contract.md
diffusion_kernel_rules.md
ghostty_claude_code_workflow.md
diffusion_benchmark_shape_coverage.md
standalone_diffusion_benchmark.md
standalone_diffusion_benchmark_template.py
external/
KernelWiki/
ncu-report-skill/
kernels/
{b200,h200}_diffusion_qknorm_rope__multi_shape/
{b200,h200}_diffusion_norm_infer__multi_shape/
{b200,h200}_diffusion_group_norm_silu__multi_shape/
{b200,h200}_diffusion_rotary_embedding__multi_shape/
{b200,h200}_diffusion_fuse_scale_shift__multi_shape/
{b200,h200}_diffusion_cutedsl_norm_tanh_mul_add__multi_shape/
{b200,h200}_diffusion_cutedsl_norm_scale_shift__multi_shape/
scripts/
launch_kda_kernel_task.sh
launch_kernels/
The old SGLang overlay/export/capture machinery has intentionally been removed.
There is no kda_kernels/, no patch applied to an SGLang checkout, and no
runtime install path. The only SGLang dependency in a diffusion task is the
upstream source code copied into baseline/ as local benchmark input.
Each diffusion task starts clean:
prompt.md # task card for the agent
config.toml # benchmark/build defaults for the task
baseline/ # copied upstream baseline source, generated by the agent
solution/ # optimized candidate source, generated by the agent
bench/ # standalone benchmark/correctness harness, generated by the agent
docs/ # source notes, benchmark logs, profile notes
The central benchmark rules live in
docs/standalone_diffusion_benchmark.md.
Diffusion optimization guardrails live in
docs/diffusion_kernel_rules.md.
Canonical regression grids live in
docs/diffusion_correctness_contract.md.
The production preset and shape audit lives in
docs/diffusion_benchmark_shape_coverage.md.
Every task prompt requires the agent to follow these documents.
Baseline and candidate must be compared through matching local interfaces. The preferred binding follows a local direct CUDA ABI:
language = "cuda"entry_point = "kernel.cu::<exported_symbol>"destination_passing_style = true- direct
TVM_FFI_DLL_EXPORT_TYPED_FUNCexport - output tensors passed as trailing arguments
- CUDA launches on
at::cuda::getCurrentCUDAStream()
The benchmark must use fixed workload rows, isolated per-workload execution,
preallocated outputs, warmups, CUDA-event timing with inner-loop amplification,
interleaved A/B sampling, strict correctness checks, and full provenance.
Use
docs/standalone_diffusion_benchmark_template.py
as the starting point for every diffusion task's bench/benchmark.py.
The existing launch scripts still create task-owned worktrees and bootstrap the task prompt for an agent run:
scripts/launch_kernels/k03_b200_diffusion_qknorm_rope__multi_shape.shSet KDA_NO_CLAUDE=1 to prepare the worktree without launching Claude.
Launchers default KDA_BASE_BRANCH to the current checkout branch, so task
worktrees inherit the branch you are testing. Set KDA_BASE_BRANCH=<ref> only
when you intentionally want to launch from another committed ref.
For manual parallel panes in Ghostty, use
docs/ghostty_claude_code_workflow.md.
Check shell launchers after edits:
bash -n scripts/launch_kda_kernel_task.sh scripts/launch_kernels/*.shExternal knowledge submodules remain optional supporting material:
git submodule update --init --recursive