perf: pre-transpose PQ codebook for SIMD-friendly L2 distance#5923
Open
wkalt wants to merge 4 commits intolance-format:mainfrom
Open
perf: pre-transpose PQ codebook for SIMD-friendly L2 distance#5923wkalt wants to merge 4 commits intolance-format:mainfrom
wkalt wants to merge 4 commits intolance-format:mainfrom
Conversation
During PQ quantization we compute L2 distances from each sub-vector
to every centroid in a codebook. The codebook is stored in AoS
(Array of Structs) layout:
AoS codebook [num_centroids][dimension]:
centroid 0: [d0 d1 d2 d3 ...]
centroid 1: [d0 d1 d2 d3 ...]
centroid 2: [d0 d1 d2 d3 ...]
...
For each centroid:
diff = query - centroid (vector subtract)
dist = sum(diff * diff) (horizontal reduction)
The horizontal reduction serializes the SIMD pipeline: each
centroid must be fully reduced before starting the next.
This patch introduces L2Prepared, which transposes the codebook once
at construction time to SoA (Structure of Arrays) layout:
SoA codebook [dimension][num_centroids]:
dim 0: [c0 c1 c2 c3 ...] <- contiguous across centroids
dim 1: [c0 c1 c2 c3 ...]
dim 2: [c0 c1 c2 c3 ...]
...
For each dimension d:
diff = query[d] - row[d] (broadcast scalar - packed vector)
result += diff * diff (packed FMA into running totals)
All centroids accumulate in parallel with no horizontal reduction
until the final result is read. LLVM emits vbroadcastss + vsubps +
vfmadd231ps (AVX2) or the equivalent on other targets.
For typical PQ parameters (256 centroids × 16-dim sub-vectors =
16 KB), the transposed codebook fits in L1 cache, making the
SoA access pattern efficient.
The transpose is done once when the ProductQuantizer is constructed
and amortized over its lifetime. The primary beneficiary is the
quantization path (transform_impl), which calls nearest() per
sub-vector per vector during index building. The search-time
distance table construction (build_l2_distance_table) also uses
the transposed layout, though it runs only once per query and
is not a bottleneck.
2ba48ef to
a3e0faa
Compare
Contributor
Author
|
yields ~70% improvement on pq_assignment benchmark on my machine: |
Contributor
Author
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Contributor
Author
Member
|
can we also do a test on Graviton 4 and GCP's Axion> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


During PQ quantization we compute L2 distances from each sub-vector
to every centroid in a codebook. The codebook is stored in AoS
(Array of Structs) layout:
AoS codebook [num_centroids][dimension]:
The horizontal reduction serializes the SIMD pipeline: each
centroid must be fully reduced before starting the next.
This patch introduces L2Prepared, which transposes the codebook once
at construction time to SoA (Structure of Arrays) layout:
SoA codebook [dimension][num_centroids]:
All centroids accumulate in parallel with no horizontal reduction
until the final result is read. LLVM emits vbroadcastss + vsubps +
vfmadd231ps (AVX2) or the equivalent on other targets.
For typical PQ parameters (256 centroids × 16-dim sub-vectors =
16 KB), the transposed codebook fits in L1 cache, making the
SoA access pattern efficient.
The transpose is done once when the ProductQuantizer is constructed
and amortized over its lifetime. The primary beneficiary is the
quantization path (transform_impl), which calls nearest() per
sub-vector per vector during index building. The search-time
distance table construction (build_l2_distance_table) also uses
the transposed layout, though it runs only once per query and
is not a bottleneck.