Mar 15, 2026

Product Quantization: Vector Compression for Fast Search

High-dimensional vectors are everywhere in modern AI: word embeddings, image features, learned representations. But storing and searching billions of 768-dimensional float vectors is expensive, both in memory and computation. Product Quantization (PQ) addresses this by compressing vectors while preserving enough information for approximate nearest neighbor search.

The Problem

Consider a database of 1 billion 128-dimensional float vectors. In raw form, that’s:

1,000,000,000 × 128 × 4 bytes = 512 GB

That’s… a lot of RAM. More than most servers have. Which is why we need compression.

Searching this database means computing distances from a query to every vector. Even with SIMD acceleration, brute-force search over 512 GB is impractical for real-time applications.

We need:

Compression: Reduce storage requirements by 10-100×
Fast distance computation: Avoid decompressing during search
Acceptable accuracy: Approximate results are fine if they’re close enough

Product Quantization delivers all three. And it does so with an elegance that, honestly, borders on unfair.

The Core Insight: Decomposition

The key observation is that a D-dimensional vector can be decomposed into M smaller subvectors, each of dimension d = D/M. If we quantize each subvector independently, the total number of possible codes is the product of per-subspace codebook sizes, hence “Product Quantization.”

For a 128-dimensional vector with M=8 subquantizers:

Original: [x₁, x₂, ..., x₁₂₈] (128 floats, 512 bytes)
         ↓
Split:    [x₁...x₁₆] | [x₁₇...x₃₂] | ... | [x₁₁₃...x₁₂₈]
            sub₀         sub₁              sub₇
         ↓
Quantize: [c₀, c₁, c₂, c₃, c₄, c₅, c₆, c₇]  (8 bytes)

Each subvector is replaced by the index of its nearest centroid in a learned codebook. With 256 centroids per subspace (8 bits), we achieve 64× compression.

Training: Learning the Codebooks

Before encoding vectors, we train codebooks by running k-means clustering on each subspace independently. K-means is one of those algorithms that sounds simple until you actually try to make it work reliably. Initialization alone can ruin your day.

Extract subvectors: For each training vector, extract the M subvectors
Cluster per subspace: Run k-means on each subspace to learn k centroids (typically k=256)
Store codebooks: Keep the M × k × d centroid coordinates

Training complexity is O(M × iterations × n × k × d), where n is the number of training vectors. Since each subspace is independent, this can be parallelized across M subspaces.

Example: Training on 1 million 128-dimensional vectors with M=8, k=256, 25 iterations:

8 independent k-means runs
Each: 25 × 1,000,000 × 256 × 16 distance computations
Total: ~800 billion operations (but only 100 billion per subspace, parallelizable)

Encoding: Vector → Code

Once trained, encoding is straightforward:

For each subspace m in 0..M:
    subvector = vector[m*d : (m+1)*d]
    code[m] = argmin_k ||subvector - centroid[m][k]||²

Each subvector is assigned to its nearest centroid. The output is M integers in range [0, k), stored as bytes when k=256.

Compression ratio: D floats (4D bytes) → M bytes = 4D/M compression. For D=128, M=8: 64x compression. Let that sink in. Your 512 GB database just became 8 GB. That fits on a single machine.

Decoding: Code → Approximate Vector

Decoding reconstructs an approximate vector by concatenating centroids:

reconstructed[m*d : (m+1)*d] = centroid[m][code[m]]

The reconstruction is lossy: the original vector is approximated by its nearest codebook entries. Quantization error depends on how well the codebooks capture the data distribution.

Distance Computation

The power of PQ comes from efficient distance computation without decoding. Two approaches exist:

Asymmetric Distance Computation (ADC)

For comparing a raw query to encoded database vectors:

Precompute: For each subspace m, compute distance from query subvector to all k centroids → M×k table entries
Search: For each encoded database vector, sum M table lookups

Precompute table:
    table[m][k] = ||query[m*d:(m+1)*d] - centroid[m][k]||²

Compute distance to code:
    dist = Σ_m table[m][code[m]]

Complexity: O(M×k×d) precomputation + O(M) per distance. For M=8, k=256, d=16: 32,768 operations to build table, then just 8 table lookups per database vector.

This is the standard approach for search: precompute once per query, then scan encoded database with cheap lookups.

Symmetric Distance Computation (SDC)

For comparing encoded vectors to each other:

Precompute (during training): For each subspace, compute pairwise centroid distances → M×k×k table
Search: For each pair of codes, sum M table lookups

Precompute table (once, during training):
    sdc[m][i][j] = ||centroid[m][i] - centroid[m][j]||²

Compute distance between codes:
    dist = Σ_m sdc[m][code1[m]][code2[m]]

Storage: M×k×k floats. For M=8, k=256: 8×256×256×4 = 2 MB per quantizer.

ADC keeps one foot in reality: raw query, compressed database. SDC jumps in with both feet, compressing everything. SDC is useful when both vectors are already encoded, for example when a ProductQuantizer itself serves as an EmbeddingSpace where the “embeddings” are PQ codes.

Trade-offs

Compression vs Accuracy

M	Bits/dim	Compression	Accuracy
D (no PQ)	32	1×	Exact
D/4	8	4×	Very good
D/8	4	8×	Good
D/16	2	16×	Moderate

More subquantizers (larger M) means finer quantization and better accuracy, but larger codes. There’s no formula for picking the right M. It’s part math, part intuition, part running it and seeing what happens.

Codebook Size

k	Bits/subspace	Training cost	Accuracy
16	4	Low	Coarse
256	8	Medium	Good
1024	10	High	Better
65536	16	Very high	Fine

More centroids capture the distribution better but require more training data (at least k points per subspace) and larger distance tables.

Typical Configuration

For 128-dimensional vectors:

M=8 subquantizers (16 dims each)
k=256 centroids (8 bits)
Code size: 8 bytes
Compression: 64×
ADC table: 8×256×4 = 8 KB per query

That’s a remarkably capable setup: 64x compression with fast ADC lookups, and it fits neatly into a few kilobytes of scratch space per query. Hard to ask for much more.

When to Use PQ

Good use cases:

Memory-constrained environments
Large-scale search (millions to billions of vectors)
Approximate results are acceptable
Query latency matters more than perfect recall

Not ideal for:

Small datasets: if your dataset fits in RAM, just use brute force. Seriously. PQ is for when brute force makes you cry.
High-precision requirements
Very low dimensions (overhead isn’t worth it)

In bytesandbrains

The bb-codec crate provides a ProductQuantizer implementation. See Product Quantization: Rust Implementation Deep Dive for details on the implementation, including:

Const-generic PQCode<M, NBITS> for compile-time code layout
PQDistanceTable for ADC
SDCTable for symmetric distance
Integration with the Codec and EmbeddingSpace traits