perf: pre-transpose PQ codebook for SIMD-friendly L2 distance by wkalt · Pull Request #5923 · lance-format/lance

wkalt · 2026-02-10T00:26:21Z

During PQ quantization we compute L2 distances from each sub-vector
to every centroid in a codebook. The codebook is stored in AoS
(Array of Structs) layout:

AoS codebook [num_centroids][dimension]:

centroid 0: [d0 d1 d2 d3 ...]
centroid 1: [d0 d1 d2 d3 ...]
centroid 2: [d0 d1 d2 d3 ...]
...

For each centroid:
  diff = query - centroid        (vector subtract)
  dist = sum(diff * diff)        (horizontal reduction)

The horizontal reduction serializes the SIMD pipeline: each
centroid must be fully reduced before starting the next.

This patch introduces L2Prepared, which transposes the codebook once
at construction time to SoA (Structure of Arrays) layout:

SoA codebook [dimension][num_centroids]:

dim 0: [c0 c1 c2 c3 ...]    <- contiguous across centroids
dim 1: [c0 c1 c2 c3 ...]
dim 2: [c0 c1 c2 c3 ...]
...

For each dimension d:
  diff = query[d] - row[d]     (broadcast scalar - packed vector)
  result += diff * diff         (packed FMA into running totals)

All centroids accumulate in parallel with no horizontal reduction
until the final result is read. LLVM emits vbroadcastss + vsubps +
vfmadd231ps (AVX2) or the equivalent on other targets.

For typical PQ parameters (256 centroids × 16-dim sub-vectors =
16 KB), the transposed codebook fits in L1 cache, making the
SoA access pattern efficient.

The transpose is done once when the ProductQuantizer is constructed
and amortized over its lifetime. The primary beneficiary is the
quantization path (transform_impl), which calls nearest() per
sub-vector per vector during index building. The search-time
distance table construction (build_l2_distance_table) also uses
the transposed layout, though it runs only once per query and
is not a bottleneck.

During PQ quantization we compute L2 distances from each sub-vector to every centroid in a codebook. The codebook is stored in AoS (Array of Structs) layout: AoS codebook [num_centroids][dimension]: centroid 0: [d0 d1 d2 d3 ...] centroid 1: [d0 d1 d2 d3 ...] centroid 2: [d0 d1 d2 d3 ...] ... For each centroid: diff = query - centroid (vector subtract) dist = sum(diff * diff) (horizontal reduction) The horizontal reduction serializes the SIMD pipeline: each centroid must be fully reduced before starting the next. This patch introduces L2Prepared, which transposes the codebook once at construction time to SoA (Structure of Arrays) layout: SoA codebook [dimension][num_centroids]: dim 0: [c0 c1 c2 c3 ...] <- contiguous across centroids dim 1: [c0 c1 c2 c3 ...] dim 2: [c0 c1 c2 c3 ...] ... For each dimension d: diff = query[d] - row[d] (broadcast scalar - packed vector) result += diff * diff (packed FMA into running totals) All centroids accumulate in parallel with no horizontal reduction until the final result is read. LLVM emits vbroadcastss + vsubps + vfmadd231ps (AVX2) or the equivalent on other targets. For typical PQ parameters (256 centroids × 16-dim sub-vectors = 16 KB), the transposed codebook fits in L1 cache, making the SoA access pattern efficient. The transpose is done once when the ProductQuantizer is constructed and amortized over its lifetime. The primary beneficiary is the quantization path (transform_impl), which calls nearest() per sub-vector per vector during index building. The search-time distance table construction (build_l2_distance_table) also uses the transposed layout, though it runs only once per query and is not a bottleneck.

wkalt · 2026-02-10T00:29:10Z

yields ~70% improvement on pq_assignment benchmark on my machine:

[~/work/sophon/src/lance] (task/prepared-l2-pq-assignment) $ cargo bench -p lance-index --bench pq_assignment
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
warning: lance-linalg@3.0.0-beta.2: fp16kernels feature is not enabled, skipping build of fp16 kernels
   Compiling lance-linalg v3.0.0-beta.2 (/home/wyatt/work/sophon/src/lance/rust/lance-linalg)
   Compiling lance-index v3.0.0-beta.2 (/home/wyatt/work/sophon/src/lance/rust/lance-index)
    Finished `bench` profile [optimized + debuginfo] target(s) in 25.99s
     Running benches/pq_assignment.rs (target/release/deps/pq_assignment-d246ac7a7d47f7f5)
Gnuplot not found, using plotters backend
Benchmarking l2,32768: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 9.7s.
l2,32768                time:   [964.15 ms 964.83 ms 965.66 ms]
                        change: [-69.989% -69.703% -69.386%] (p = 0.00 < 0.10)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking dot,32768: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 35.7s.
dot,32768               time:   [3.5569 s 3.5594 s 3.5618 s]
                        change: [+0.0341% +0.1304% +0.2110%] (p = 0.01 < 0.10)
                        Change within noise threshold.

wkalt · 2026-02-10T00:51:16Z

here is a picture of the speedup in shuffling during IVF-PQ index build. I think the difference in IVF training memory is a sampling artifact.

edit: baseline here is another branch with a stack of optimizations excluding this one, not against main.

codecov · 2026-02-10T01:00:46Z

Codecov Report

❌ Patch coverage is 96.49123% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-linalg/src/distance/l2.rs	94.87%	6 Missing ⚠️
rust/lance-index/src/vector/pq.rs	98.03%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

wkalt · 2026-02-10T01:33:31Z

the last commit is an additional marginal improvement

eddyxu · 2026-02-24T00:36:12Z

can we also do a test on Graviton 4 and GCP's Axion>

github-actions bot added the performance label Feb 10, 2026

wkalt force-pushed the task/prepared-l2-pq-assignment branch from 2ba48ef to a3e0faa Compare February 10, 2026 00:28

format

3d2161f

Reduce allocations

005a6e8

format

d989f11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: pre-transpose PQ codebook for SIMD-friendly L2 distance#5923

perf: pre-transpose PQ codebook for SIMD-friendly L2 distance#5923
wkalt wants to merge 4 commits intolance-format:mainfrom
wkalt:task/prepared-l2-pq-assignment

wkalt commented Feb 10, 2026 •

edited

Loading

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

wkalt commented Feb 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

eddyxu commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wkalt commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

wkalt commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wkalt commented Feb 10, 2026

Uh oh!

eddyxu commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wkalt commented Feb 10, 2026 •

edited

Loading

wkalt commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading