feat(ascend): add 9 Ascend operator kernels by zhangyue207 · Pull Request #47 · InfiniTensor/InfiniOps

zhangyue207 · 2026-04-08T09:30:36Z

Summary

Full Ascend operator set (18 operators + framework scaffolding), plus three
foundational bug fixes surfaced during a rebase onto latest master and four
rounds of API alignment with vLLM's conventions.

Compared to the original PR #47, this branch adds three things on top:

Rebase onto master 2ffbeb0 (which now includes PR refactor: group backends by hardware category #60 WorkspacePool
rename, PR fix: make cuda/caster.cuh self-contained by including data_type.h #62 caster fix, PR refactor: rename non-PascalCase function names per Google C++ style #63 SFINAE autodetect, and so on)
Three independent bug fixes exposed by the rebase (framework / CI /
bindings generator)
Skip coverage + vLLM API alignment — 18 fewer skipifs (of the
original 1682) and six new test cases

Full suite: 3767 passed / 1664 skipped / 0 failed on Ascend 910B + CANN 8.5.1.

Ascend operators (18)

Operator	Primary impl	Notes
`Add`	`aclnnAdd`
`Mul`	`aclnnMul`
`Cast`	`aclnnCast`
`Cat`	`aclnnCat`
`Matmul`	`aclnnMatmul`
`Gemm`	`aclnnMm`
`Linear`	`aclnnMatmul` + optional bias
`RmsNorm`	`aclnnRmsNorm` + custom AscendC
`AddRmsNorm`	3 impls: decomposed / fused aclnn / custom
`Swiglu`	`aclnnSilu` + `aclnnMul`
`SiluAndMul`	custom AscendC
`CausalSoftmax`	`aclnnSoftmax` + mask
`RotaryEmbedding`	3 impls: V2 / ATB `RopeParam` / `aclnnRopeWithSinCosCache`	filled out in this PR
`ApplyRotaryPosEmb`	V2 (takes pre-gathered cos/sin)
`ReshapeAndCache`	3 impls: aclnn `InplaceIndexCopy` / custom / ATB
`FlashAttention`	`aclnnFusedInferAttentionScoreV4` (prefill + paged decode)
`PagedAttention`	ATB `PagedAttentionParam` (with CPU-pinned D2H-free entry)
`TopkToppSampling`	ATB `TopkToppSamplingParam`

Framework / generator / CI fixes

`fix(scripts): py::arg order in bindings generator`

Root cause: the pybind11 bindings emitted by scripts/generate_wrappers.py
listed py::arg entries in a different order than the C++ lambda parameters.
When callers used kwargs, implementation_index and stream were silently
swapped — the stream integer went into the impl-index slot, and dispatch
SIGABRTed.

Fix: emit py::arg("implementation_index") before py::arg("stream") so
the kwarg names line up positionally with the C++ signature.

`fix(ci): treat exit 137 as success when pytest junit XML reports no failures`

Symptom: the Docker 18.09 chown step occasionally receives a SIGKILL,
so .ci/run.py exits with code 137 even though pytest itself completed
cleanly.

Fix: read /workspace/results/test-results.xml errors / failures
fields to determine true failure — don't treat a teardown race as a test
failure.

`fix(ascend): adopt PR #63/#60 master API`

Bring Ascend code up to date with the latest master conventions:

WorkspacePool::GetInstance() → GetWorkspacePool()
Pool::Get(stream, size) → Pool::Ensure(stream, size)
Remove per-op registry.h (use ActiveImplementationsImpl SFINAE
autodetect instead)

Skip coverage

Area	Prior skips	Now passing
PagedAttention: drop stale 910B skip (fixed in CANN 8.5.1)	10	+10
RotaryEmbedding impl=1 supports `is_neox_style=false` (ATB `RopeParam` `rotaryCoeff=head_size` interleave mode)	4	+4
RotaryEmbedding impl=2 supports partial rotary (`rotary_dim < head_size`) via `aclnnRopeWithSinCosCache`	4	+4

Only remaining skip: RotaryEmbedding impl=0 (V2) with
is_neox_style=false — V2 only plumbs rotaryMode="half", out of scope for
this PR.

vLLM API alignment (additive, no breaking change)

`perf(reshape_and_cache): int64 slot_mapping async Cast`

ATB ReshapeAndCacheParam requires int32 slot_mapping. The previous
implementation handled int64 (PyTorch / vLLM's native dtype) via D2H + CPU
cast + H2D + aclrtSynchronizeStream, which stalled the stream and made
the int64 path NPUGraph-incapturable.

Replaced with a cached aclnnCast async conversion on-stream. Performance
matches the int32 pass-through and the whole op is now graph-captureable.

`feat(rotary_embedding): optional query_out / key_out`

vLLM's RotaryEmbedding.forward(positions, query, key) is inplace;
InfiniOps previously required the caller to pass query_out / key_out.

Both parameters are now std::optional<Tensor>:

Omitted → kernel writes back to query / key (vLLM semantics)
Supplied → previous out-of-place semantics

All three impls (V2, ATB, SinCosCache) support this.
test_rotary_embedding_inplace covers both dtypes × two impls.

`feat(flash_attention): add sliding_window entry (additive)`

Native window_left / window_right pair kept as-is; added an optional
std::optional<int64_t> sliding_window:

Pair only → unchanged behavior
sliding_window only → normalized to (sliding_window - 1, 0) causal
sliding (vLLM convention)
Both → asserted consistent

generate_wrappers.py extended: all std::optional<...> parameters now
default to = py::none() (previously only std::optional<Tensor> had that
default).

test_flash_attention_sliding_window_equivalence asserts bit-exact
equivalence between the two entry points.

`docs(paged_attention): host tensor contract`

The src/base/paged_attention.h class comment now explains why
seq_lens_host / block_table_host exist (CANN qSeqLens CPU-resident
contract + ATB hostData + NPUGraph capture prerequisite), so future
backend implementors understand the API contract.

Test status

=========================  ascend 910B / CANN 8.5.1 =========================
3767 passed, 1664 skipped, 0 failed, 0 error

Key delta:

Total skips: 1682 → 1664 (−18 = 10 PA + 4 non-neox + 4 partial)
New tests: Rope inplace × 4, FA sliding_window × 2 (+6)

Non-Ascend platforms: not exercised locally. This PR touches only
Ascend / base / scripts / ci; it does not alter the CUDA / Metax /
Cambricon / Moore / Iluvatar operator paths. CI covers those.

Review guidance

Can be cherry-picked independently: the 3 framework / CI fixes
(pybind arg order / CI exit 137 / PR refactor: rename non-PascalCase function names per Google C++ style #63 API adoption) — they are
operator-agnostic and broadly useful; splitting them out into small
PRs could accelerate master convergence.
Main body: Ascend operators + API alignment — large diff;
recommend reviewing per operator (each kernel file is self-contained).
feat/ascend-operators-bak-2026-04-18 is a pre-force-push backup ref
(points at the original PR feat(ascend): add 9 Ascend operator kernels #47 tip 0d93135). Safe to delete once
merged.

Test plan

python3 .ci/run.py --local (full regression, Ascend 910B): 3767 passed
test_rotary_embedding_inplace (fp16/bf16 × impl=0/1): 4 passed
test_flash_attention_sliding_window_equivalence (pair vs sliding_window bit-exact): 2 passed
test_reshape_and_cache (int32 + int64 paths): 32 passed
test_paged_attention (10 passed after 910B skip removal)
clang-format passes locally on all tracked *.h / *.cc / *.cuh / *.mlu
CUDA / Metax / Cambricon / Moore / Iluvatar regressions (CI-verified)

- Add AclTensorCache for descriptor reuse across operator calls - Rename ToAclDtype/IsIntegerDtype to toAclDtype/isIntegerDtype (camelCase) - Extend WorkspacePool with multi-slot support and capture-mode assertion - Optimize Gemm kernel with executor/scalar caching - Add CacheKey hash support for operator instance caching - Fix generate_wrappers.py argument ordering and format - Rename skip_unsupported_dtypes fixture, add get_npu_stream utility

Add base classes: Cast, Cat, Linear, Matmul (replaces MatMul), Mul, PagedAttention, SiluAndMul. Rename AddRmsNorm params to match CANN convention (x1/x2/gamma/y_out/x_out). Remove verbose doc comments from FlashAttention, ReshapeAndCache, RotaryEmbedding base classes (implementation details belong in kernels).

Add ACLNN-based implementations for: Add, Cast, Cat, CausalSoftmax, FlashAttention, Linear, Matmul, Mul, RmsNorm, RotaryEmbedding, ReshapeAndCache (+ v2), Swiglu, SiluAndMul. All kernels use AclTensorCache for descriptor reuse and WorkspacePool for device memory management. Executor instances are cached with aclSetAclOpExecutorRepeatable for repeat dispatch.

Add alternative implementations with registries: - AddRmsNorm: decomposed (0), fused aclnnAddRmsNorm (1), custom AscendC (2) - RmsNorm: ACLNN (0), custom AscendC (1) - RotaryEmbedding: ACLNN (0), ATB Rope (1) - ReshapeAndCache: ACLNN (0), ScatterPaKvCache (1), ATB (2) - Swiglu: decomposed (0), fused aclnnSwiGlu (1) - SiluAndMul: fused aclnnSwiGlu (0), registry (1) - PagedAttention: ATB (0)

Standalone AscendC kernel project with CMake build system. Includes op_host tiling, op_kernel device code, precision tests, and msprof benchmarks for both operators.

Add new tests: Cast, Cat, E2E Layer, FlashAttention, Linear, Matmul, Mul, PagedAttention, ReshapeAndCache, RotaryEmbedding, SiluAndMul. Update existing tests with NPU stream handling and Ascend-specific parametrization.

- C1: auto-format all C++ files with clang-format (25 files) - C4: lowercase assert messages, remove trailing periods (10 messages) - G4: backtick-fence identifiers in comments (causal_softmax) - P5: add blank lines before return statements (generate_wrappers.py)

- C4: lowercase assert message starts (workspace_pool_, rms_norm, rotary_embedding) - C4: remove trailing period from workspace_pool_ assert - C9: add blank line between SlotKey struct members - G4: backtick-fence identifiers in comments across 12 files - G4: backtick-fence identifiers in assert messages (flash_attention, rotary_embedding) - P1: remove duplicate `import re` in generate_wrappers.py - P4: add blank lines around control flow in test_flash_attention.py

- C4: lowercase "rope" in ATB assert messages - G4: backtick-fence `VariantPack`, `rotaryCoeff`, `sparseMode`, `hostData` - G4: backtick-fence identifiers in Python test comments - P4: add blank line before `if` in test_rms_norm_precision.py

… loading - Delete `test_rms_norm_precision.py` (duplicate of `tests/test_rms_norm.py`) - Delete `run_rms_norm_precision_report.py` (another copy with hardcoded path) - Unify `test_add_rms_norm.py` to use `import ascend_kernel` instead of ctypes manual loading

New operators and features: - ApplyRotaryPosEmb: pre-gathered cos/sin operator with ATB backend - TopkToppSampling: ATB-based fused sampling operator - SiluAndMul: standalone operator backed by aclnnSwiGlu - ATB PagedAttention: graph-safe decode attention Enhancements: - WorkspacePool: multi-slot support and capture-mode assertion - Migrate temp buffers to WorkspacePool slots (Swiglu, CausalSoftmax, RmsNorm, AddRmsNorm) - RotaryEmbedding: accept 2D [T, N*D] input, fix ATB cos/sin gathering - ReshapeAndCache: handle int64 slot_mapping in ATB kernel - Swiglu: add fused aclnnSwiGlu implementation (index=1) - Parametrize rms_norm and reshape_and_cache tests by implementation_index

… data The operator cache keys ignore data pointers (compare only shape/dtype/ device/strides). When RotaryEmbedding was cached from one test and reused by another with a different cos_sin_cache tensor (same shape, different random data), the IndexSelect gathered from the old tables, producing garbage output. Track the cos_sin_cache data pointer and re-upload the expanded cos/sin tables when it changes. In production this is a single pointer comparison per call (no-op); the cos_sin_cache weight tensor has a stable address. Fixes 6 rotary_embedding_2d test failures (head_size=64, fp16, both CANN and ATB paths) that only reproduced when test_apply_rotary_pos_emb ran first.

Replace per-operator stale-cache workaround with Operator::clear_cache() generation counter. pytest autouse fixture clears caches between test modules. Skip aclnnScatterPaKvCache (impl_index=1) on 910B hardware. Synced from feat/ascend-operators commits c68633f, 57f96bf.

ATB Rope with rotaryCoeff=2 supports bf16 on 910B. Remove the fp16-only skip guard — all 6 previously skipped bf16 test cases pass.

Extend PagedAttention base class and ATB kernel with optional seq_lens_host / block_table_host params that skip aclrtMemcpy D2H copies when caller provides CPU-pinned host tensors. Add unit tests for host-tensor PA and FA paged decode with CPU cu_seqlens_kv.

`aclDestroyAclOpExecutor` internally frees `aclTensor` descriptors it holds. Add `AclTensorCache::release()` and `destroy()` methods, guard all destructors with `isAclRuntimeAlive()`, and remove redundant `aclDestroyTensor` calls for executor-owned tensors. Verified: CANN reference-counts tensors, so destroy-tensor-then-destroy-executor order is safe.

…STOM_KERNELS

…ssion

…ager

…used-attention plan

…ion passes

…pture

…nation

…/forward

…eferred)

…pt-in

…ng, ratio table first, decision matrix

…CT_BACKENDS=OFF The docs/perf/*.md files documented the e2e optimization mission and are not part of the Ascend operator kernel scope this PR delivers. Removed from the tip; unchanged in intermediate history. AUTO_DETECT_BACKENDS default flipped to OFF in pyproject.toml to avoid the openblas link failure in the ascend CI container (master's torch-backend auto-detect requires libgfortran symbols not present there). Build still enables torch backend explicitly when requested.

Container-side openblas linker issue will be fixed separately; do not regress the master-level default in this PR.

torch wheels on aarch64 (including `torch==2.9.0+cpu` used in the ascend CI container) are auditwheel-repaired and bundle transitive dependencies (`libgfortran-<hash>.so`, `libopenblasp-<hash>.so`) into a sibling `torch.libs/` directory. `torch.utils.cpp_extension.library_paths()` returns only `torch/lib`, so the linker cannot resolve the bundled NEEDED entries and fails with `undefined reference to _gfortran_etime@GFORTRAN_8`. Add `torch.libs/` to both the build and install rpath, plus `-rpath-link` for link-time resolution without polluting our final NEEDED list.

…name + drop registry.h (SFINAE autodetect)

…m order (impl before stream)

…ailures Docker 18.09 on Ascend CI hosts races on `--rm` cleanup: the inner process exits cleanly with rc=0 but the daemon SIGKILLs the container during teardown, surfacing exit code 137 to `run.py` even though the pytest stage succeeded. Parse the per-run junit XML when returncode==137 and downgrade to a warning if no failures/errors are reported.

The skip was based on an outdated diagnosis that ATB PagedAttention crashes during Setup on 910B + CANN 8.5.x. After the framework rebase onto master (which includes the pybind11 kw arg order fix), all 10 parametrizations pass on 910B4 with CANN 8.5.1. Keep the NPU-available and implementation-registered checks since they are cheap, structural prerequisites.

RotaryEmbedding impl=1 (ATB Rope) now plumbs both rotary styles: - is_neox_style=true -> rotaryCoeff=2 (half split + cat) - is_neox_style=false -> rotaryCoeff=head_size (interleave) The cos/sin expand path also branches: neox layout duplicates the half values front/back, while interleave layout repeats each value pair-wise. Test skip is narrowed to impl=0 only, which still uses aclnnApplyRotaryPosEmbV2 (declares "interleave" but only implements "half"). G (partial rotary) skip message updated to reflect that neither aclnn nor ATB fused APIs support rotary_dim < head_size.

Partial rotary (`rotary_dim < head_size`) is not expressible in the V2 (`aclnnApplyRotaryPosEmbV2`, impl=0) or ATB `RopeParam` (impl=1) APIs — both require `cos.D == sin.D == x.D`. `aclnnRopeWithSinCosCache` is the only Ascend fused API that accepts partial rotary natively; it also supports both neox and interleave styles via `isNeoxStyle` bool. `test_rotary_embedding_partial` now routes through impl=2, resolving the 4 G-case skips.

…t` exist The rationale (CANN CPU-tensor contract + NPUGraph capturability) was only documented in the Ascend ATB kernel header. Surface it on the base class where the API contract lives, so any future backend implementor understands why the optional host tensors are part of the signature.

…sync `aclnnCast` The ATB `ReshapeAndCacheParam` (impl=2) int64 path previously did `aclrtMemcpyAsync` D2H + CPU int64→int32 cast + `aclrtMemcpyAsync` H2D with an explicit `aclrtSynchronizeStream` in between. The sync blocks the stream and makes the int64 path NPUGraph-incompatible, which forced callers (vllm-infini) to pre-cast `slot_mapping` to int32 on the Python side (36 redundant Cast launches otherwise per decoding step). Route the int64 branch through a cached `aclnnCast` instead: src/dst tensor descriptors live in `AclTensorCache` slots, the executor is set repeatable, and the cast stays fully async on-stream. The whole op now matches vLLM's native int64 `slot_mapping` convention without the sync penalty.

…e-default) Align with vLLM's `RotaryEmbedding.forward(positions, query, key)` signature by letting callers omit the output buffers — the kernel then writes back in place on `query` / `key`. This removes a signature mismatch that forced vllm-infini to allocate and pass explicit out tensors it doesn't need. Base class signature: `query_out` / `key_out` → `std::optional<Tensor>` with `std::nullopt` default. Shape / stride members fall back to `query` / `key` when the optional is empty. All three Ascend impls resolve the optional to a concrete `Tensor` at the top of `operator()` via `value_or(query)`: - impl=0 (aclnn V2): skips the D2D memcpy in the inplace case since `query.data() == q_out.data()` - impl=1 (ATB RopeParam): same short-circuit on the D2D copy - impl=2 (aclnnRopeWithSinCosCache): descriptors reuse `q_out` / `k_out` pointers, so the kernel writes to whichever tensor is resolved Adds `test_rotary_embedding_inplace` covering both fp16 / bf16 on impl=0 and impl=1. Tolerance is atol=5e-3 — matches the V2 ~4 ULP fp16 accumulator error documented in `kernel.h`.

Keeps the native `window_left` / `window_right` pair as-is and adds an optional `std::optional<int64_t> sliding_window` parameter. When set, the base class normalizes it to the causal-sliding pair `(sliding_window - 1, 0)`; when both forms are supplied the normalized values must agree. Callers can now use either entry point: // Pair form (existing, unchanged): flash_attention(..., window_left=255, window_right=0, ...) // vLLM form: flash_attention(..., sliding_window=256, ...) Ascend impl reads the resolved pair from the base-class members (`window_left_` / `window_right_`) so `sliding_window` is honored at both construction and call time. Also extends `generate_wrappers.py` to set `py::arg(...) = py::none()` defaults for all `std::optional<...>` parameters (previously only `std::optional<Tensor>`), so `sliding_window` is properly optional on the Python side. Adds `test_flash_attention_sliding_window_equivalence` asserting bit-exact equality between the two entry points.

…l files

zhangyue207 force-pushed the feat/ascend-framework branch 2 times, most recently from bf9e4b1 to 7398f9f Compare April 13, 2026 13:41

Base automatically changed from feat/ascend-framework to master April 14, 2026 03:55

zhangyue207 force-pushed the feat/ascend-operators branch from 3f43d57 to be48553 Compare April 15, 2026 07:06

voltjia requested review from Ziminli April 17, 2026 06:12

zhangyue207 requested a review from voltjia April 17, 2026 06:20

zhangyue added 23 commits April 18, 2026 00:01

feat(cpu): add CPU implementations for Cast, Cat, Linear, Mul

a6bcf65

feat(ascend): add custom AscendC kernels for RmsNorm and AddRmsNorm

803480d

Standalone AscendC kernel project with CMake build system. Includes op_host tiling, op_kernel device code, precision tests, and msprof benchmarks for both operators.

test(ascend): add comprehensive tests for all Ascend operators

15134eb

Add new tests: Cast, Cat, E2E Layer, FlashAttention, Linear, Matmul, Mul, PagedAttention, ReshapeAndCache, RotaryEmbedding, SiluAndMul. Update existing tests with NPU stream handling and Ascend-specific parametrization.

fix(ci): add Ascend toolkit environment variables to CI Dockerfile

b0cc676

style: apply ruff format and clang-format to all modified files

992b176

style: fix ruff F401 lint errors for side-effect imports

cc873dc

test(ascend): enable ATB RoPE bfloat16 tests

3d8331b

ATB Rope with rotaryCoeff=2 supports bf16 on 910B. Remove the fp16-only skip guard — all 6 previously skipped bf16 test cases pass.

style: apply clang-format to all modified C++ files

bec60f6

style: apply ruff format to test and utility files

72e3ed5

refactor(ascend): consolidate custom kernel macros into INFINI_HAS_CU…

ba3eb2a

…STOM_KERNELS

zhangyue added 26 commits April 18, 2026 00:03

docs(perf): record 0.5B host profile + stream-cache correctness regre…

f4dab13

…ssion

docs(perf): record stream-ptr cache results — both models clear 80% e…

d14e381

…ager

docs(perf): design note — dispatch-count reduction (F1/F2) replaces f…

5d1b654

…used-attention plan

docs(perf): dispatch-count mystery resolved — vllm-ascend uses fx fus…

a356ad4

…ion passes

docs(perf): capture-replay sanity — all 4 infini.ops pass npugraph ca…

2535dd7

…pture

docs(perf): record G2 FX-rewrite throughput (3B graph 63.8% -> 71.5%)

4721eb1

docs(perf): scoped G1 fusion design — P-1 split_rope + P-3 noop_elimi…

c16646a

…nation

docs(perf): update G1 design with FX dump — P-1 pattern confirmed 36x…

6c80d60

…/forward

docs(perf): record fusion-pass scaffolding commit (zero passes, P-3 d…

25cff65

…eferred)

docs(perf): record #28 split_rope_collapse as measured-within-noise o…

16b6352

…pt-in

docs(perf): mission final report — eager met, graph capped at 71%

6c89d2d

docs(perf): rewrite mission_final per team-lead review — honest frami…

8e90809

…ng, ratio table first, decision matrix

docs(perf): reflect #29 GatherV3-at-parity finding in mission final

f81a423

revert(pr47): restore AUTO_DETECT_BACKENDS=ON default

668c114

Container-side openblas linker issue will be fixed separately; do not regress the master-level default in this PR.

fix(ascend): adopt PR #63/#60 master API — GetWorkspacePool/Ensure re…

8eacfef

…name + drop registry.h (SFINAE autodetect)

fix(scripts): py::arg order in bindings generator must match C++ para…

a3cd770

…m order (impl before stream)

zhangyue207 force-pushed the feat/ascend-operators branch from 0d93135 to df07f95 Compare April 17, 2026 20:34

zhangyue added 2 commits April 18, 2026 04:37

style: apply clang-format to recent API-alignment changes

828f252

style: apply clang-format to silu_and_mul/causal_softmax/swiglu kerne…

1ed8fb3

…l files

zhangyue207 marked this pull request as draft April 18, 2026 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ascend): add 9 Ascend operator kernels#47

feat(ascend): add 9 Ascend operator kernels#47
zhangyue207 wants to merge 56 commits intomasterfrom
feat/ascend-operators

zhangyue207 commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangyue207 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Ascend operators (18)

Framework / generator / CI fixes

fix(scripts): py::arg order in bindings generator

fix(ci): treat exit 137 as success when pytest junit XML reports no failures

fix(ascend): adopt PR #63/#60 master API

Skip coverage

vLLM API alignment (additive, no breaking change)

perf(reshape_and_cache): int64 slot_mapping async Cast

feat(rotary_embedding): optional query_out / key_out

feat(flash_attention): add sliding_window entry (additive)

docs(paged_attention): host tensor contract

Test status

Review guidance

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangyue207 commented Apr 8, 2026 •

edited

Loading

`fix(scripts): py::arg order in bindings generator`

`fix(ci): treat exit 137 as success when pytest junit XML reports no failures`

`fix(ascend): adopt PR #63/#60 master API`

`perf(reshape_and_cache): int64 slot_mapping async Cast`

`feat(rotary_embedding): optional query_out / key_out`

`feat(flash_attention): add sliding_window entry (additive)`

`docs(paged_attention): host tensor contract`