Add per-variant CPU feature gating to bit_transpose benchmarks by joseph-isaacs · Pull Request #8227 · vortex-data/vortex

joseph-isaacs · 2026-06-02T22:57:17Z

What

Trial of explicit, per-benchmark CPU feature-set / architecture gating, applied to the bit_transpose benchmarks. Each benchmark declares inline which variants it should be measured under in CI, while a plain cargo bench ignores all gating and runs everything once on the host.

Single source of truth = the compile-time BENCH_VARIANT env var, which drives both the name prefix (so arch-neutral scalar benches don't collide in CodSpeed) and the gate (run vs skip).

Macros (`encodings/fastlanes/benches/shared/mod.rs`)

variant!("name") — prefixes the bench name with the active variant.
variant_tag!(ident) — maps a known variant identifier to its string tag; an unknown identifier fails to compile (typo safety).
ignore_unless_variant!(...) — expands to divan's ignore boolean: skip unless BENCH_VARIANT=local (default) or the active variant is one of the listed feature sets.

Per-benchmark tags (`bit_transpose.rs`)

benchmarks	tags
scalar (baseline)	`simulation, x86_64, aarch64`
bmi2 / vbmi	`simulation, x86_64`
neon	`aarch64`

CI (`.github/workflows/codspeed.yml`)

The existing bench-codspeed job now builds with BENCH_VARIANT=simulation, so the simulation-tagged variants run there in simulation mode (x86_64 + avx2) — no local:: rename, no duplication.
New bench-codspeed-bittranspose job: walltime legs on real silicon, one per architecture, each building only --bench bit_transpose with its own target features + BENCH_VARIANT:
- x86_64 — amd64-medium / ubuntu24-full-x64-pre-v2, -C target-feature=+avx2
- aarch64 — arm64-medium / ubuntu24-full-arm64-pre-v2, -C target-feature=+neon

Behavior

Context	`BENCH_VARIANT`	What runs	Names	Mode
Local `cargo bench`	`local`	all benches once	`local::<fn>`	divan walltime
`bench-codspeed`	`simulation`	scalar + bmi2 + vbmi	`simulation::<fn>`	simulation
`bittranspose` x86_64 leg	`x86_64`	scalar + bmi2 + vbmi	`x86_64::<fn>`	walltime
`bittranspose` aarch64 leg	`aarch64`	scalar + neon	`aarch64::<fn>`	walltime

Checks

cargo build / cargo clippy --all-features / cargo +nightly fmt --check on the bench — clean.
yamllint --strict -c .yamllint.yaml on the workflow — clean.
Runtime gating verified on an x86 host: BENCH_VARIANT=aarch64 skips bmi2/vbmi ((ignored)) and runs only the scalar baselines; BENCH_VARIANT=x86_64 runs scalar + bmi2 (vbmi shows the pre-existing "no function registered" warning because the dev host lacks AVX512-VBMI).

Notes:

The vbmi path shares the x86_64 build rather than forcing global +avx512vbmi (which risks SIGILL in surrounding code on non-AVX512 runners); the #[target_feature] intrinsics + has_vbmi() runtime guard handle it safely.
Variant tags are arch-level (x86_64/aarch64) rather than avx2/neon, because bit_transpose's x86 paths (BMI2/VBMI) are runtime-selected within a single x86 build.

https://claude.ai/code/session_01MkzByEJLta4WN2vLqRyvZ1

Generated by Claude Code

Each bit_transpose benchmark now declares, inline, which CPU feature sets / architectures it should be measured under in CI, via a small set of macros driven by the compile-time BENCH_VARIANT environment variable: - variant! prefixes the benchmark name with the active variant so the architecture-neutral scalar benchmarks (which run on every leg) do not collide in CodSpeed. - variant_tag! maps a known variant identifier to its string tag; an unknown identifier fails to compile, giving typo-safe tags. - ignore_unless_variant! expands to divan's `ignore` boolean, skipping a benchmark unless we run locally (BENCH_VARIANT=local, the default) or the active variant is one of the listed feature sets. A plain `cargo bench` leaves BENCH_VARIANT at its `local` default (set in .cargo/config.toml) and runs every benchmark once on the host. CI sets BENCH_VARIANT per leg: - the existing bench-codspeed job builds with BENCH_VARIANT=simulation, so the simulation-tagged scalar/bmi2/vbmi variants run there in simulation mode on x86_64+avx2; - a new bench-codspeed-bittranspose job adds walltime legs on real silicon, one per architecture (x86_64 with +avx2, aarch64 with +neon), each building only the bit_transpose bench with its own target features. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

The bit_transpose aarch64 Codspeed leg failed in the system-info step: `grep -m1 "model name" /proc/cpuinfo` returns no match on ARM (no such line; the model is shown by lscpu), and under GitHub's `bash -e` the failing grep aborts the otherwise-diagnostic step. ARM also exposes CPU features as "Features" rather than "flags". Make both cpuinfo greps non-fatal and match the aarch64 "Features" line so the diagnostic step never fails the build on either architecture. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

joseph-isaacs added the changelog/skip Do not list PR in the changelog label Jun 2, 2026 — with Claude

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-variant CPU feature gating to bit_transpose benchmarks#8227

Add per-variant CPU feature gating to bit_transpose benchmarks#8227
joseph-isaacs wants to merge 2 commits into
developfrom
claude/bituntranspose-bench-variants-ArFmf

joseph-isaacs commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented Jun 2, 2026

What

Macros (encodings/fastlanes/benches/shared/mod.rs)

Per-benchmark tags (bit_transpose.rs)

CI (.github/workflows/codspeed.yml)

Behavior

Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Macros (`encodings/fastlanes/benches/shared/mod.rs`)

Per-benchmark tags (`bit_transpose.rs`)

CI (`.github/workflows/codspeed.yml`)