Separate query and storage alignment requirements [MOD-13837] by dor-forer · Pull Request #946 · RedisAI/VectorSimilarity

dor-forer · 2026-05-03T13:28:15Z

Describe the changes in the pull request

Split the single alignment hint carried by the preprocessor stack into two independent values, query_alignment and storage_alignment, so that storage and query buffers can satisfy different SIMD alignment requirements. This is the prerequisite plumbing for asymmetric kernels such as SQ8↔FP32, where the storage operand (e.g. SQ8 → 16 B) and the query operand (e.g. FP32 → 32 B) have different preferences.

Key changes:

PreprocessorInterface::preprocess now takes both storage_alignment and query_alignment. The legacy single-alignment overload at the lowest layer is removed; the same-blob path is decided by the concrete preprocessor (CosinePreprocessor shares, QuantPreprocessor does not).
PreprocessorsContainerAbstract / MultiPreprocessorsContainer store both fields and expose getQueryAlignment() / getStorageAlignment(). The deprecated getAlignment() alias has been removed; all callers now explicitly pick one.
CreatePreprocessorsContainer and CreatePreprocessorsContainerParams get an asymmetric overload that takes both alignments; the existing single-alignment overload mirrors the value into both fields, so all current production callers are unchanged behaviorally.
CosinePreprocessor allocates its shared buffer with spaces::combineAlignments(storage_alignment, query_alignment) (the strictest power-of-two that satisfies both consumers).
QuantPreprocessor::preprocessForStorage honors storage_alignment when allocating the quantized blob.
The asymmetric-types contract for GetDistFunc is documented in spaces.h: the returned hint refers to the first operand only; callers that need both must query both dispatchers and combine.
The x86 SQ8 dispatchers (IP_space.cpp, L2_space.cpp) now publish per-feature alignment hints (AVX512 → 16 / 32, AVX2_FMA → 8, AVX2 → 8, SSE4 → 4, no-opt → 0) for both the SQ8↔SQ8 and SQ8↔FP32 paths.
VecSimAllocator::allocate_aligned(size, 0) is documented/exercised as equivalent to allocate(size), allowing the preprocessor allocation sites to drop the alignment ? aligned : plain ternary helper and call the aligned variant unconditionally.

No in-tree production caller currently produces asymmetric alignments: CreateIndexComponents is templated on a single DataType, queries one dispatcher, and forwards the same hint to both fields. The asymmetric path is reachable today only by callers that build the components manually (or by the disk-backed quantized index in the follow-up vecsim_disk repo, where query_alignment = fullAlignment; storage_alignment = combineAlignments(quantizedAlignment, mixedAlignment)).

Test changes:

PreprocessorsTest.QuantizationAsymmetricAlignment (new) exercises query_alignment=32, storage_alignment=16 through preprocess, preprocessForStorage, and preprocessQuery, asserting independent pointer alignment for both blobs.
SpacesTest.SQ8_FP32_DispatcherAlignmentHints and SpacesTest.SQ8_SQ8_DispatcherAlignmentHints (new) directly assert the alignment hint published by the IP / L2 / Cosine dispatchers per CPU feature, including the no-optimization 0-case.
Existing tests that previously used alignment = 5 were bumped to alignment = 8 so they satisfy the new combineAlignments power-of-two invariant. test_allocator, test_bruteforce, and test_hnsw_tiered were migrated from getAlignment() to getStorageAlignment() (every site was already storage-semantic).
Full ctest -j 4 sweep: 2150 / 2150 passing.

Which issues this PR fixes

MOD-13837

Main objects this PR modified

PreprocessorInterface / CosinePreprocessor / QuantPreprocessor (src/VecSim/spaces/computer/preprocessors.h)
PreprocessorsContainerAbstract / MultiPreprocessorsContainer (src/VecSim/spaces/computer/preprocessor_container.{h,cpp})
CreatePreprocessorsContainer / CreatePreprocessorsContainerParams (src/VecSim/index_factories/components/preprocessors_factory.h)
VecSimIndexAbstract accessors (src/VecSim/vec_sim_index.h)
SQ8 dispatchers in src/VecSim/spaces/IP_space.cpp and src/VecSim/spaces/L2_space.cpp
spaces::combineAlignments and the asymmetric-types contract documentation (src/VecSim/spaces/spaces.h)

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Note

Medium Risk
Touches core preprocessing/allocation paths and SIMD-dispatcher alignment hints, so mis-propagated alignment could cause performance regressions or subtle crashes on optimized kernels. Changes are mostly plumbing with added tests to guard behavior.

Overview
Adds separate alignment hints for query vs stored vectors across the preprocessor stack, replacing the single alignment value with query_alignment and storage_alignment (with backward-compatible single-alignment overloads).

Updates preprocessors/containers to propagate and honor the correct alignment per allocation path (including shared-buffer cases via spaces::combineAlignments), switches index storage allocation to use the storage alignment explicitly, and extends SQ8 distance dispatchers to publish per-ISA alignment hints for the SQ8 (first/storage) operand.

Tests are adjusted to use power-of-two alignments, callers are migrated from getAlignment() to getStorageAlignment(), and new unit tests assert asymmetric alignment behavior and exact dispatcher hint values.

^{Reviewed by Cursor Bugbot for commit b9f8af9. Bugbot is set up for automated code reviews on this repo. Configure here.}

Split the single alignment hint carried by PreprocessorsContainerAbstract into query_alignment and storage_alignment so different SIMD kernels (e.g. SQ8 storage vs FP32 query) can declare different alignment needs. - spaces/spaces.h: add combineAlignments() helper and document the GetDistFunc asymmetric-types contract (alignment hint refers to the first / storage operand only). - spaces/IP_space.cpp, spaces/L2_space.cpp: SQ8<->FP32 and SQ8<->SQ8 dispatchers now publish per-kernel alignment hints. - spaces/computer/preprocessors.h: PreprocessorInterface preprocess() and preprocessForStorage() carry storage_alignment / query_alignment. CosinePreprocessor uses combineAlignments() for the shared-buffer path. Removed the same-size preprocess() overload from the interface. - spaces/computer/preprocessor_container.{h,cpp}: split alignment field into query_alignment and storage_alignment; legacy single-arg constructors retained as wrappers. Added getQueryAlignment() / getStorageAlignment(); getAlignment() kept as a deprecated alias returning storage_alignment. - index_factories/components/preprocessors_factory.h: extend PreprocessorsContainerParams and CreatePreprocessorsContainer overloads to accept both alignments; single-alignment overloads kept for callers that don't need the distinction. - vec_sim_index.h: DataBlocksContainer now built with getStorageAlignment(); added getQueryAlignment() / getStorageAlignment() accessors. - tests/unit/test_components.cpp, tests/unit/test_hnsw_tiered.cpp: updated mock preprocessors and fixtures to the new interface. - tests/unit/test_spaces.cpp: SelfDistanceCosine / SelfDistanceL2 for SQ8<->FP32 now disable AVX2_FMA / AVX2 / SSE4 before the no-optimization assertion, mirroring the rest of the spaces tests.

…ests QuantPreprocessor::preprocessForStorage() now allocates the quantized storage blob with allocate_aligned() when storage_alignment != 0, so SQ8 storage actually honors the per-kernel alignment hint plumbed through PreprocessorsContainerAbstract. New tests: - PreprocessorsTest.QuantizationAsymmetricAlignment: drives the quant preprocessor with query_alignment=32 and storage_alignment=16 and asserts that preprocess(), preprocessForStorage(), and preprocessQuery() each return a pointer aligned to its respective hint. - SpacesTest.SQ8_FP32_DispatcherAlignmentHints / SQ8_SQ8_DispatcherAlignmentHints: assert the exact alignment-hint values published by the x86 SQ8 dispatchers (AVX512 -> 16/32, AVX2_FMA -> 8, AVX2 -> 8, SSE4 -> 4), plus the no-optimization 0-case, for IP / L2 / Cosine.

VecSimAllocator::allocate_aligned(size, 0) already falls through to allocate(size), so the helper's alignment ternary was redundant. Inline the four call sites to a direct allocate_aligned() call.

Remove the temporary getAlignment() forwarders from PreprocessorsContainerAbstract and VecSimIndexAbstract. All callers in test_bruteforce, test_hnsw_tiered, and test_allocator now use the explicit getStorageAlignment() accessor, which is what every site already meant (DataBlock allocation overhead, brute-force storage, homogeneous-alignment fixture).

jit-ci · 2026-05-03T13:29:51Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

Copilot

Pull request overview

This PR threads separate query/storage alignment hints through the preprocessor pipeline so asymmetric kernels can request different SIMD alignments for stored vectors and queries. It updates the core preprocessor/container interfaces, publishes new SQ8 dispatcher hints, and adjusts unit tests to cover the new alignment behavior.

Changes:

Split the old single alignment hint into query_alignment and storage_alignment across preprocessors, containers, factories, and index accessors.
Updated cosine/quantization preprocessing paths and x86 SQ8 dispatchers to use the new alignment semantics.
Refreshed unit tests to validate asymmetric quantization alignment and dispatcher hint values.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`tests/unit/test_spaces.cpp`	Adds SQ8 self-distance coverage for more x86 feature branches and new dispatcher-alignment hint tests.
`tests/unit/test_hnsw_tiered.cpp`	Updates test preprocessor overrides to the new alignment-aware interface and accessor.
`tests/unit/test_components.cpp`	Refactors dummy preprocessors/tests for separate query vs. storage alignment and adds asymmetric quantization coverage.
`tests/unit/test_bruteforce.cpp`	Switches alignment sanity test to the new storage-specific accessor.
`tests/unit/test_allocator.cpp`	Updates memory-accounting expectations to use storage alignment instead of the removed generic accessor.
`src/VecSim/vec_sim_index.h`	Replaces the old alignment accessor with query/storage variants and uses storage alignment for vector block allocation.
`src/VecSim/spaces/spaces.h`	Documents asymmetric `GetDistFunc` alignment semantics and adds `combineAlignments`.
`src/VecSim/spaces/computer/preprocessors.h`	Changes the preprocessor interface to accept separate query/storage alignments and updates cosine/quantization implementations.
`src/VecSim/spaces/computer/preprocessor_container.h`	Stores separate query/storage alignment fields and propagates them through multi-preprocessor execution.
`src/VecSim/spaces/computer/preprocessor_container.cpp`	Makes the default aligned-copy helper use query alignment specifically.
`src/VecSim/spaces/L2_space.cpp`	Publishes alignment hints for SQ8 L2 dispatchers on x86.
`src/VecSim/spaces/IP_space.cpp`	Publishes alignment hints for SQ8 IP/Cosine dispatchers on x86.
`src/VecSim/index_factories/components/preprocessors_factory.h`	Extends preprocessor factory params/overloads to carry separate query and storage alignment hints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dor-forer · 2026-05-03T13:42:41Z

+    // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
 #ifdef OPT_AVX512_F_BW_VL_VNNI
    if (features.avx512f && features.avx512bw && features.avx512vnni) {
+        if (dim % 16 == 0) // SQ8 chunk = 16 bytes
+            *alignment = 16 * sizeof(uint8_t);
        return Choose_SQ8_FP32_IP_implementation_AVX512F_BW_VL_VNNI(dim);


This matches the long-standing convention used by every existing dispatcher in IP_space.cpp / L2_space.cpp (FP32, FP64, BF16, FP16, INT8, UINT8). E.g. IP_FP32_GetDistFunc writes *alignment only inside the optimized branches and leaves it untouched on the scalar fallback. The contract — honored by the single production caller CreateIndexComponents — is that the caller initializes *alignment = 0 before invoking GetDistFunc. Flipping that to "fallback unconditionally writes 0" would have to be done across every *_GetDistFunc for consistency, which is out of scope for MOD-13837.

dor-forer · 2026-05-03T13:42:43Z

+    // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
 #ifdef OPT_AVX512_F_BW_VL_VNNI
    if (features.avx512f && features.avx512bw && features.avx512vnni) {
+        if (dim % 16 == 0) // SQ8 chunk = 16 bytes
+            *alignment = 16 * sizeof(uint8_t);
        return Choose_SQ8_FP32_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);


Same convention as the rest of the dispatcher layer (see e.g. IP_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

dor-forer · 2026-05-03T13:42:46Z

+    // AVX512 VNNI SQ8_SQ8 uses 64-element chunks; residual handling is in 32-byte sub-chunks.
    if (dim >= 64 && features.avx512f && features.avx512bw && features.avx512vnni) {
+        if (dim % 32 == 0) // align to 256 bits when there is no offsetting residual
+            *alignment = 32 * sizeof(uint8_t);
        return Choose_SQ8_SQ8_Cosine_implementation_AVX512F_BW_VL_VNNI(dim);


Same convention as the rest of the dispatcher layer (see e.g. IP_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

dor-forer · 2026-05-03T13:42:47Z

+    // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract.
 #ifdef OPT_AVX512_F_BW_VL_VNNI
    if (features.avx512f && features.avx512bw && features.avx512vnni) {
+        if (dim % 16 == 0) // SQ8 chunk = 16 bytes; no point in aligning if there's a residual
+            *alignment = 16 * sizeof(uint8_t);
        return Choose_SQ8_FP32_L2_implementation_AVX512F_BW_VL_VNNI(dim);


Same convention as the rest of the dispatcher layer (see e.g. L2_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

dor-forer · 2026-05-03T13:42:54Z

+        // No-optimization path must leave the hint at 0.
+        unsigned char alignment = 0;
+        (void)get(dim, &alignment, &opt);
+        ASSERT_EQ(alignment, 0u) << kind << ": no-optimization hint should be 0";


The test asserts the documented contract: caller initializes *alignment = 0, fallback leaves it untouched. Seeding a non-zero sentinel would test a stricter contract that no *_GetDistFunc in this codebase implements — FP32, FP64, BF16, FP16, INT8, and UINT8 all leave the output untouched on the scalar fallback. Aligning the SQ8 tests to that same convention is intentional.

+static inline unsigned char combineAlignments(unsigned char a, unsigned char b) {
+    assert((a == 0 || (a & (a - 1)) == 0) && "alignment must be a power of two or zero");
+    assert((b == 0 || (b & (b - 1)) == 0) && "alignment must be a power of two or zero");
+    return a > b ? a : b;


dor-forer · 2026-05-03T13:42:49Z

+    // AVX512 VNNI SQ8_SQ8 uses 64-element chunks; residual handling is in 32-byte sub-chunks.
    if (dim >= 64 && features.avx512f && features.avx512bw && features.avx512vnni) {
+        if (dim % 32 == 0) // align to 256 bits when there is no offsetting residual
+            *alignment = 32 * sizeof(uint8_t);
        return Choose_SQ8_SQ8_IP_implementation_AVX512F_BW_VL_VNNI(dim);


Same convention as the rest of the dispatcher layer (see e.g. IP_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

dor-forer · 2026-05-03T13:42:52Z

+    // AVX512 VNNI SQ8_SQ8 uses 64-element chunks; residual handling is in 32-byte sub-chunks.
    if (dim >= 64 && features.avx512f && features.avx512bw && features.avx512vnni) {
+        if (dim % 32 == 0) // align to 256 bits when there is no offsetting residual
+            *alignment = 32 * sizeof(uint8_t);
        return Choose_SQ8_SQ8_L2_implementation_AVX512F_BW_VL_VNNI(dim);


Same convention as the rest of the dispatcher layer (see e.g. L2_FP32_GetDistFunc in this file): *alignment is written only on optimized branches; the scalar fallback leaves it untouched. The caller (CreateIndexComponents) zero-initializes the output before calling. Diverging here would be inconsistent with FP32 / FP64 / BF16 / FP16 / INT8 / UINT8; flipping the contract globally is a separate refactor.

dor-forer · 2026-05-03T13:42:56Z

+        // No-optimization path must leave the hint at 0.
+        unsigned char alignment = 0;
+        (void)get(dim, &alignment, &opt);
+        ASSERT_EQ(alignment, 0u) << kind << ": no-optimization hint should be 0";


Same response as the SQ8_FP32 test above: the assertion matches the documented contract (caller initializes *alignment = 0, fallback leaves it untouched), which is what every other *_GetDistFunc in the codebase implements. A non-zero sentinel would test a stricter contract that doesn't exist anywhere in the dispatcher layer.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit b9f8af9. Configure here.}

cursor · 2026-05-04T09:16:11Z

-        blob = this->allocator->allocate(storage_bytes_count);
+        blob = storage_alignment
+                   ? this->allocator->allocate_aligned(storage_bytes_count, storage_alignment)
+                   : this->allocator->allocate(storage_bytes_count);


Inconsistent ternary kept in preprocessForStorage allocation

Low Severity

QuantPreprocessor::preprocessForStorage retains the storage_alignment ? allocate_aligned(...) : allocate(...) ternary, while the PR description explicitly states that allocate_aligned(size, 0) is equivalent to allocate(size) and the ternary can be dropped. Within the same class, preprocessQuery already calls allocate_aligned unconditionally, and CosinePreprocessor::preprocessForStorage also drops the ternary. This inconsistency within the same class is confusing and contradicts the stated intent of the change.

^{Reviewed by Cursor Bugbot for commit b9f8af9. Configure here.}

codecov · 2026-05-04T09:33:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.96%. Comparing base (5bcc53e) to head (b9f8af9).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #946      +/-   ##
==========================================
+ Coverage   96.71%   96.96%   +0.25%     
==========================================
  Files         129      130       +1     
  Lines        8057     7785     -272     
==========================================
- Hits         7792     7549     -243     
+ Misses        265      236      -29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dor-forer added 5 commits May 3, 2026 09:53

MOD-13837: drop CosinePreprocessor::allocateBlob helper

7d45514

VecSimAllocator::allocate_aligned(size, 0) already falls through to allocate(size), so the helper's alignment ternary was redundant. Inline the four call sites to a direct allocate_aligned() call.

Refactor alignment parameters in preprocessors for consistency

b9f8af9

dor-forer changed the title ~~Dorer split alignment~~ Separate query and storage alignment requirements [MOD-13837] May 3, 2026

dor-forer requested review from Copilot and meiravgri May 3, 2026 13:30

Copilot started reviewing on behalf of dor-forer May 3, 2026 13:31 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

dor-forer requested review from lerman25 and removed request for meiravgri May 4, 2026 09:09

dor-forer marked this pull request as ready for review May 4, 2026 09:10

cursor Bot reviewed May 4, 2026

View reviewed changes

Conversation

dor-forer commented May 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jit-ci Bot commented May 3, 2026

🛡️ Jit Security Scan Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 4, 2026

Choose a reason for hiding this comment

Inconsistent ternary kept in preprocessForStorage allocation

Uh oh!

codecov Bot commented May 4, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dor-forer commented May 3, 2026 •

edited by cursor Bot

Loading

Inconsistent ternary kept in `preprocessForStorage` allocation