Skip to content

Separate query and storage alignment requirements [MOD-13837]#946

Open
dor-forer wants to merge 5 commits intomainfrom
dorer-split-alignment
Open

Separate query and storage alignment requirements [MOD-13837]#946
dor-forer wants to merge 5 commits intomainfrom
dorer-split-alignment

Conversation

@dor-forer
Copy link
Copy Markdown
Collaborator

@dor-forer dor-forer commented May 3, 2026

Describe the changes in the pull request

Split the single alignment hint carried by the preprocessor stack into two independent values, query_alignment and storage_alignment, so that storage and query buffers can satisfy different SIMD alignment requirements. This is the prerequisite plumbing for asymmetric kernels such as SQ8↔FP32, where the storage operand (e.g. SQ8 → 16 B) and the query operand (e.g. FP32 → 32 B) have different preferences.

Key changes:

  • PreprocessorInterface::preprocess now takes both storage_alignment and query_alignment. The legacy single-alignment overload at the lowest layer is removed; the same-blob path is decided by the concrete preprocessor (CosinePreprocessor shares, QuantPreprocessor does not).
  • PreprocessorsContainerAbstract / MultiPreprocessorsContainer store both fields and expose getQueryAlignment() / getStorageAlignment(). The deprecated getAlignment() alias has been removed; all callers now explicitly pick one.
  • CreatePreprocessorsContainer and CreatePreprocessorsContainerParams get an asymmetric overload that takes both alignments; the existing single-alignment overload mirrors the value into both fields, so all current production callers are unchanged behaviorally.
  • CosinePreprocessor allocates its shared buffer with spaces::combineAlignments(storage_alignment, query_alignment) (the strictest power-of-two that satisfies both consumers).
  • QuantPreprocessor::preprocessForStorage honors storage_alignment when allocating the quantized blob.
  • The asymmetric-types contract for GetDistFunc is documented in spaces.h: the returned hint refers to the first operand only; callers that need both must query both dispatchers and combine.
  • The x86 SQ8 dispatchers (IP_space.cpp, L2_space.cpp) now publish per-feature alignment hints (AVX512 → 16 / 32, AVX2_FMA → 8, AVX2 → 8, SSE4 → 4, no-opt → 0) for both the SQ8↔SQ8 and SQ8↔FP32 paths.
  • VecSimAllocator::allocate_aligned(size, 0) is documented/exercised as equivalent to allocate(size), allowing the preprocessor allocation sites to drop the alignment ? aligned : plain ternary helper and call the aligned variant unconditionally.

No in-tree production caller currently produces asymmetric alignments: CreateIndexComponents is templated on a single DataType, queries one dispatcher, and forwards the same hint to both fields. The asymmetric path is reachable today only by callers that build the components manually (or by the disk-backed quantized index in the follow-up vecsim_disk repo, where query_alignment = fullAlignment; storage_alignment = combineAlignments(quantizedAlignment, mixedAlignment)).

Test changes:

  • PreprocessorsTest.QuantizationAsymmetricAlignment (new) exercises query_alignment=32, storage_alignment=16 through preprocess, preprocessForStorage, and preprocessQuery, asserting independent pointer alignment for both blobs.
  • SpacesTest.SQ8_FP32_DispatcherAlignmentHints and SpacesTest.SQ8_SQ8_DispatcherAlignmentHints (new) directly assert the alignment hint published by the IP / L2 / Cosine dispatchers per CPU feature, including the no-optimization 0-case.
  • Existing tests that previously used alignment = 5 were bumped to alignment = 8 so they satisfy the new combineAlignments power-of-two invariant. test_allocator, test_bruteforce, and test_hnsw_tiered were migrated from getAlignment() to getStorageAlignment() (every site was already storage-semantic).
  • Full ctest -j 4 sweep: 2150 / 2150 passing.

Which issues this PR fixes

  1. MOD-13837

Main objects this PR modified

  1. PreprocessorInterface / CosinePreprocessor / QuantPreprocessor (src/VecSim/spaces/computer/preprocessors.h)
  2. PreprocessorsContainerAbstract / MultiPreprocessorsContainer (src/VecSim/spaces/computer/preprocessor_container.{h,cpp})
  3. CreatePreprocessorsContainer / CreatePreprocessorsContainerParams (src/VecSim/index_factories/components/preprocessors_factory.h)
  4. VecSimIndexAbstract accessors (src/VecSim/vec_sim_index.h)
  5. SQ8 dispatchers in src/VecSim/spaces/IP_space.cpp and src/VecSim/spaces/L2_space.cpp
  6. spaces::combineAlignments and the asymmetric-types contract documentation (src/VecSim/spaces/spaces.h)

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

Medium Risk
Touches core preprocessing/allocation paths and SIMD-dispatcher alignment hints, so mis-propagated alignment could cause performance regressions or subtle crashes on optimized kernels. Changes are mostly plumbing with added tests to guard behavior.

Overview
Adds separate alignment hints for query vs stored vectors across the preprocessor stack, replacing the single alignment value with query_alignment and storage_alignment (with backward-compatible single-alignment overloads).

Updates preprocessors/containers to propagate and honor the correct alignment per allocation path (including shared-buffer cases via spaces::combineAlignments), switches index storage allocation to use the storage alignment explicitly, and extends SQ8 distance dispatchers to publish per-ISA alignment hints for the SQ8 (first/storage) operand.

Tests are adjusted to use power-of-two alignments, callers are migrated from getAlignment() to getStorageAlignment(), and new unit tests assert asymmetric alignment behavior and exact dispatcher hint values.

Reviewed by Cursor Bugbot for commit b9f8af9. Bugbot is set up for automated code reviews on this repo. Configure here.

dor-forer added 5 commits May 3, 2026 09:53
Split the single alignment hint carried by PreprocessorsContainerAbstract
into query_alignment and storage_alignment so different SIMD kernels
(e.g. SQ8 storage vs FP32 query) can declare different alignment needs.

- spaces/spaces.h: add combineAlignments() helper and document the
  GetDistFunc asymmetric-types contract (alignment hint refers to the
  first / storage operand only).
- spaces/IP_space.cpp, spaces/L2_space.cpp: SQ8<->FP32 and SQ8<->SQ8
  dispatchers now publish per-kernel alignment hints.
- spaces/computer/preprocessors.h: PreprocessorInterface preprocess()
  and preprocessForStorage() carry storage_alignment / query_alignment.
  CosinePreprocessor uses combineAlignments() for the shared-buffer
  path. Removed the same-size preprocess() overload from the interface.
- spaces/computer/preprocessor_container.{h,cpp}: split alignment field
  into query_alignment and storage_alignment; legacy single-arg
  constructors retained as wrappers. Added getQueryAlignment() /
  getStorageAlignment(); getAlignment() kept as a deprecated alias
  returning storage_alignment.
- index_factories/components/preprocessors_factory.h: extend
  PreprocessorsContainerParams and CreatePreprocessorsContainer overloads
  to accept both alignments; single-alignment overloads kept for callers
  that don't need the distinction.
- vec_sim_index.h: DataBlocksContainer now built with
  getStorageAlignment(); added getQueryAlignment() / getStorageAlignment()
  accessors.
- tests/unit/test_components.cpp, tests/unit/test_hnsw_tiered.cpp:
  updated mock preprocessors and fixtures to the new interface.
- tests/unit/test_spaces.cpp: SelfDistanceCosine / SelfDistanceL2 for
  SQ8<->FP32 now disable AVX2_FMA / AVX2 / SSE4 before the
  no-optimization assertion, mirroring the rest of the spaces tests.
…ests

QuantPreprocessor::preprocessForStorage() now allocates the quantized
storage blob with allocate_aligned() when storage_alignment != 0, so
SQ8 storage actually honors the per-kernel alignment hint plumbed
through PreprocessorsContainerAbstract.

New tests:
- PreprocessorsTest.QuantizationAsymmetricAlignment: drives the quant
  preprocessor with query_alignment=32 and storage_alignment=16 and
  asserts that preprocess(), preprocessForStorage(), and preprocessQuery()
  each return a pointer aligned to its respective hint.
- SpacesTest.SQ8_FP32_DispatcherAlignmentHints / SQ8_SQ8_DispatcherAlignmentHints:
  assert the exact alignment-hint values published by the x86 SQ8
  dispatchers (AVX512 -> 16/32, AVX2_FMA -> 8, AVX2 -> 8, SSE4 -> 4),
  plus the no-optimization 0-case, for IP / L2 / Cosine.
VecSimAllocator::allocate_aligned(size, 0) already falls through to
allocate(size), so the helper's alignment ternary was redundant. Inline
the four call sites to a direct allocate_aligned() call.
Remove the temporary getAlignment() forwarders from
PreprocessorsContainerAbstract and VecSimIndexAbstract. All callers in
test_bruteforce, test_hnsw_tiered, and test_allocator now use the
explicit getStorageAlignment() accessor, which is what every site
already meant (DataBlock allocation overhead, brute-force storage,
homogeneous-alignment fixture).
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 3, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@dor-forer dor-forer changed the title Dorer split alignment Separate query and storage alignment requirements [MOD-13837] May 3, 2026
@dor-forer dor-forer requested review from Copilot and meiravgri May 3, 2026 13:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR threads separate query/storage alignment hints through the preprocessor pipeline so asymmetric kernels can request different SIMD alignments for stored vectors and queries. It updates the core preprocessor/container interfaces, publishes new SQ8 dispatcher hints, and adjusts unit tests to cover the new alignment behavior.

Changes:

  • Split the old single alignment hint into query_alignment and storage_alignment across preprocessors, containers, factories, and index accessors.
  • Updated cosine/quantization preprocessing paths and x86 SQ8 dispatchers to use the new alignment semantics.
  • Refreshed unit tests to validate asymmetric quantization alignment and dispatcher hint values.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/unit/test_spaces.cpp Adds SQ8 self-distance coverage for more x86 feature branches and new dispatcher-alignment hint tests.
tests/unit/test_hnsw_tiered.cpp Updates test preprocessor overrides to the new alignment-aware interface and accessor.
tests/unit/test_components.cpp Refactors dummy preprocessors/tests for separate query vs. storage alignment and adds asymmetric quantization coverage.
tests/unit/test_bruteforce.cpp Switches alignment sanity test to the new storage-specific accessor.
tests/unit/test_allocator.cpp Updates memory-accounting expectations to use storage alignment instead of the removed generic accessor.
src/VecSim/vec_sim_index.h Replaces the old alignment accessor with query/storage variants and uses storage alignment for vector block allocation.
src/VecSim/spaces/spaces.h Documents asymmetric GetDistFunc alignment semantics and adds combineAlignments.
src/VecSim/spaces/computer/preprocessors.h Changes the preprocessor interface to accept separate query/storage alignments and updates cosine/quantization implementations.
src/VecSim/spaces/computer/preprocessor_container.h Stores separate query/storage alignment fields and propagates them through multi-preprocessor execution.
src/VecSim/spaces/computer/preprocessor_container.cpp Makes the default aligned-copy helper use query alignment specifically.
src/VecSim/spaces/L2_space.cpp Publishes alignment hints for SQ8 L2 dispatchers on x86.
src/VecSim/spaces/IP_space.cpp Publishes alignment hints for SQ8 IP/Cosine dispatchers on x86.
src/VecSim/index_factories/components/preprocessors_factory.h Extends preprocessor factory params/overloads to carry separate query and storage alignment hints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +73 to 78
// Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
#ifdef OPT_AVX512_F_BW_VL_VNNI
if (features.avx512f && features.avx512bw && features.avx512vnni) {
if (dim % 16 == 0) // SQ8 chunk = 16 bytes
*alignment = 16 * sizeof(uint8_t);
return Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(dim);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the long-standing convention used by every existing dispatcher in IP_space.cpp / L2_space.cpp (FP32, FP64, BF16, FP16, INT8, UINT8). E.g. IP_FP32_GetDistFunc writes *alignment only inside the optimized branches and leaves it untouched on the scalar fallback. The contract — honored by the single production caller CreateIndexComponents — is that the caller initializes *alignment = 0 before invoking GetDistFunc. Flipping that to "fallback unconditionally writes 0" would have to be done across every *_GetDistFunc for consistency, which is out of scope for MOD-13837.

Comment on lines +141 to 146
// Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
#ifdef OPT_AVX512_F_BW_VL_VNNI
if (features.avx512f && features.avx512bw && features.avx512vnni) {
if (dim % 16 == 0) // SQ8 chunk = 16 bytes
*alignment = 16 * sizeof(uint8_t);
return Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same convention as the rest of the dispatcher layer (see e.g. IP_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

Comment on lines +258 to 262
// AVX512 VNNI SQ8_SQ8 uses 64-element chunks; residual handling is in 32-byte sub-chunks.
if (dim >= 64 && features.avx512f && features.avx512bw && features.avx512vnni) {
if (dim % 32 == 0) // align to 256 bits when there is no offsetting residual
*alignment = 32 * sizeof(uint8_t);
return Choose_SQ8_SQ8_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same convention as the rest of the dispatcher layer (see e.g. IP_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

Comment on lines +73 to 78
// Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
#ifdef OPT_AVX512_F_BW_VL_VNNI
if (features.avx512f && features.avx512bw && features.avx512vnni) {
if (dim % 16 == 0) // SQ8 chunk = 16 bytes; no point in aligning if there's a residual
*alignment = 16 * sizeof(uint8_t);
return Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(dim);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same convention as the rest of the dispatcher layer (see e.g. L2_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

Comment on lines +3964 to +3967
// No-optimization path must leave the hint at 0.
unsigned char alignment = 0;
(void)get(dim, &alignment, &opt);
ASSERT_EQ(alignment, 0u) << kind << ": no-optimization hint should be 0";
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test asserts the documented contract: caller initializes *alignment = 0, fallback leaves it untouched. Seeding a non-zero sentinel would test a stricter contract that no *_GetDistFunc in this codebase implements — FP32, FP64, BF16, FP16, INT8, and UINT8 all leave the output untouched on the scalar fallback. Aligning the SQ8 tests to that same convention is intentional.

Comment on lines +38 to +41
static inline unsigned char combineAlignments(unsigned char a, unsigned char b) {
assert((a == 0 || (a & (a - 1)) == 0) && "alignment must be a power of two or zero");
assert((b == 0 || (b & (b - 1)) == 0) && "alignment must be a power of two or zero");
return a > b ? a : b;
Comment on lines +211 to 215
// AVX512 VNNI SQ8_SQ8 uses 64-element chunks; residual handling is in 32-byte sub-chunks.
if (dim >= 64 && features.avx512f && features.avx512bw && features.avx512vnni) {
if (dim % 32 == 0) // align to 256 bits when there is no offsetting residual
*alignment = 32 * sizeof(uint8_t);
return Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(dim);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same convention as the rest of the dispatcher layer (see e.g. IP_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

Comment on lines +468 to 472
// AVX512 VNNI SQ8_SQ8 uses 64-element chunks; residual handling is in 32-byte sub-chunks.
if (dim >= 64 && features.avx512f && features.avx512bw && features.avx512vnni) {
if (dim % 32 == 0) // align to 256 bits when there is no offsetting residual
*alignment = 32 * sizeof(uint8_t);
return Choose_SQ8_SQ8_L2_implementation_AVX512F_BW_VL_VNNI(dim);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same convention as the rest of the dispatcher layer (see e.g. L2_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

Comment on lines +3991 to +3994
// No-optimization path must leave the hint at 0.
unsigned char alignment = 0;
(void)get(dim, &alignment, &opt);
ASSERT_EQ(alignment, 0u) << kind << ": no-optimization hint should be 0";
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as the SQ8_FP32 test above: the assertion matches the documented contract (caller initializes *alignment = 0, fallback leaves it untouched), which is what every other *_GetDistFunc in the codebase implements. A non-zero sentinel would test a stricter contract that doesn't exist anywhere in the dispatcher layer.

@dor-forer dor-forer requested review from lerman25 and removed request for meiravgri May 4, 2026 09:09
@dor-forer dor-forer marked this pull request as ready for review May 4, 2026 09:10
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b9f8af9. Configure here.

blob = this->allocator->allocate(storage_bytes_count);
blob = storage_alignment
? this->allocator->allocate_aligned(storage_bytes_count, storage_alignment)
: this->allocator->allocate(storage_bytes_count);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent ternary kept in preprocessForStorage allocation

Low Severity

QuantPreprocessor::preprocessForStorage retains the storage_alignment ? allocate_aligned(...) : allocate(...) ternary, while the PR description explicitly states that allocate_aligned(size, 0) is equivalent to allocate(size) and the ternary can be dropped. Within the same class, preprocessQuery already calls allocate_aligned unconditionally, and CosinePreprocessor::preprocessForStorage also drops the ternary. This inconsistency within the same class is confusing and contradicts the stated intent of the change.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b9f8af9. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.96%. Comparing base (5bcc53e) to head (b9f8af9).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #946      +/-   ##
==========================================
+ Coverage   96.71%   96.96%   +0.25%     
==========================================
  Files         129      130       +1     
  Lines        8057     7785     -272     
==========================================
- Hits         7792     7549     -243     
+ Misses        265      236      -29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants