Skip to content

perf: pre-transpose PQ codebook for SIMD-friendly L2 distance#5923

Open
wkalt wants to merge 4 commits intolance-format:mainfrom
wkalt:task/prepared-l2-pq-assignment
Open

perf: pre-transpose PQ codebook for SIMD-friendly L2 distance#5923
wkalt wants to merge 4 commits intolance-format:mainfrom
wkalt:task/prepared-l2-pq-assignment

Conversation

@wkalt
Copy link
Contributor

@wkalt wkalt commented Feb 10, 2026

During PQ quantization we compute L2 distances from each sub-vector
to every centroid in a codebook. The codebook is stored in AoS
(Array of Structs) layout:

AoS codebook [num_centroids][dimension]:

centroid 0: [d0 d1 d2 d3 ...]
centroid 1: [d0 d1 d2 d3 ...]
centroid 2: [d0 d1 d2 d3 ...]
...

For each centroid:
  diff = query - centroid        (vector subtract)
  dist = sum(diff * diff)        (horizontal reduction)

The horizontal reduction serializes the SIMD pipeline: each
centroid must be fully reduced before starting the next.

This patch introduces L2Prepared, which transposes the codebook once
at construction time to SoA (Structure of Arrays) layout:

SoA codebook [dimension][num_centroids]:

dim 0: [c0 c1 c2 c3 ...]    <- contiguous across centroids
dim 1: [c0 c1 c2 c3 ...]
dim 2: [c0 c1 c2 c3 ...]
...

For each dimension d:
  diff = query[d] - row[d]     (broadcast scalar - packed vector)
  result += diff * diff         (packed FMA into running totals)

All centroids accumulate in parallel with no horizontal reduction
until the final result is read. LLVM emits vbroadcastss + vsubps +
vfmadd231ps (AVX2) or the equivalent on other targets.

For typical PQ parameters (256 centroids × 16-dim sub-vectors =
16 KB), the transposed codebook fits in L1 cache, making the
SoA access pattern efficient.

The transpose is done once when the ProductQuantizer is constructed
and amortized over its lifetime. The primary beneficiary is the
quantization path (transform_impl), which calls nearest() per
sub-vector per vector during index building. The search-time
distance table construction (build_l2_distance_table) also uses
the transposed layout, though it runs only once per query and
is not a bottleneck.

During PQ quantization we compute L2 distances from each sub-vector
to every centroid in a codebook. The codebook is stored in AoS
(Array of Structs) layout:

  AoS codebook [num_centroids][dimension]:

    centroid 0: [d0 d1 d2 d3 ...]
    centroid 1: [d0 d1 d2 d3 ...]
    centroid 2: [d0 d1 d2 d3 ...]
    ...

    For each centroid:
      diff = query - centroid        (vector subtract)
      dist = sum(diff * diff)        (horizontal reduction)

The horizontal reduction serializes the SIMD pipeline: each
centroid must be fully reduced before starting the next.

This patch introduces L2Prepared, which transposes the codebook once
at construction time to SoA (Structure of Arrays) layout:

  SoA codebook [dimension][num_centroids]:

    dim 0: [c0 c1 c2 c3 ...]    <- contiguous across centroids
    dim 1: [c0 c1 c2 c3 ...]
    dim 2: [c0 c1 c2 c3 ...]
    ...

    For each dimension d:
      diff = query[d] - row[d]     (broadcast scalar - packed vector)
      result += diff * diff         (packed FMA into running totals)

All centroids accumulate in parallel with no horizontal reduction
until the final result is read. LLVM emits vbroadcastss + vsubps +
vfmadd231ps (AVX2) or the equivalent on other targets.

For typical PQ parameters (256 centroids × 16-dim sub-vectors =
16 KB), the transposed codebook fits in L1 cache, making the
SoA access pattern efficient.

The transpose is done once when the ProductQuantizer is constructed
and amortized over its lifetime. The primary beneficiary is the
quantization path (transform_impl), which calls nearest() per
sub-vector per vector during index building. The search-time
distance table construction (build_l2_distance_table) also uses
the transposed layout, though it runs only once per query and
is not a bottleneck.
@wkalt wkalt force-pushed the task/prepared-l2-pq-assignment branch from 2ba48ef to a3e0faa Compare February 10, 2026 00:28
@wkalt
Copy link
Contributor Author

wkalt commented Feb 10, 2026

yields ~70% improvement on pq_assignment benchmark on my machine:

[~/work/sophon/src/lance] (task/prepared-l2-pq-assignment) $ cargo bench -p lance-index --bench pq_assignment
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
   Compiling lance-linalg v3.0.0-beta.2 (/home/wyatt/work/sophon/src/lance/rust/lance-linalg)
   Compiling lance-index v3.0.0-beta.2 (/home/wyatt/work/sophon/src/lance/rust/lance-index)
    Finished `bench` profile [optimized + debuginfo] target(s) in 25.99s
     Running benches/pq_assignment.rs (target/release/deps/pq_assignment-d246ac7a7d47f7f5)
Gnuplot not found, using plotters backend
Benchmarking l2,32768: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.7s.
l2,32768                time:   [964.15 ms 964.83 ms 965.66 ms]
                        change: [-69.989% -69.703% -69.386%] (p = 0.00 < 0.10)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking dot,32768: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 35.7s.
dot,32768               time:   [3.5569 s 3.5594 s 3.5618 s]
                        change: [+0.0341% +0.1304% +0.2110%] (p = 0.01 < 0.10)
                        Change within noise threshold.

@wkalt
Copy link
Contributor Author

wkalt commented Feb 10, 2026

progress

here is a picture of the speedup in shuffling during IVF-PQ index build. I think the difference in IVF training memory is a sampling artifact.

edit: baseline here is another branch with a stack of optimizations excluding this one, not against main.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 96.49123% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-linalg/src/distance/l2.rs 94.87% 6 Missing ⚠️
rust/lance-index/src/vector/pq.rs 98.03% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wkalt
Copy link
Contributor Author

wkalt commented Feb 10, 2026

the last commit is an additional marginal improvement
progress

@eddyxu
Copy link
Member

eddyxu commented Feb 24, 2026

can we also do a test on Graviton 4 and GCP's Axion>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants