Product Quantization: Vector Compression for Fast Search


High-dimensional vectors are everywhere in modern AI: word embeddings, image features, learned representations. But storing and searching billions of 768-dimensional float vectors is expensive, both in memory and computation. Product Quantization (PQ) addresses this by compressing vectors while preserving enough information for approximate nearest neighbor search.

The Problem

Consider a database of 1 billion 128-dimensional float vectors. In raw form, that’s:

1,000,000,000 × 128 × 4 bytes = 512 GB

That’s… a lot of RAM. More than most servers have. Which is why we need compression.

Searching this database means computing distances from a query to every vector. Even with SIMD acceleration, brute-force search over 512 GB is impractical for real-time applications.

We need:

  1. Compression: Reduce storage requirements by 10-100×
  2. Fast distance computation: Avoid decompressing during search
  3. Acceptable accuracy: Approximate results are fine if they’re close enough

Product Quantization delivers all three. And it does so with an elegance that, honestly, borders on unfair.

The Core Insight: Decomposition

The key observation is that a D-dimensional vector can be decomposed into M smaller subvectors, each of dimension d = D/M. If we quantize each subvector independently, the total number of possible codes is the product of per-subspace codebook sizes, hence “Product Quantization.”

For a 128-dimensional vector with M=8 subquantizers:

Original: [x₁, x₂, ..., x₁₂₈] (128 floats, 512 bytes)

Split:    [x₁...x₁₆] | [x₁₇...x₃₂] | ... | [x₁₁₃...x₁₂₈]
            sub₀         sub₁              sub₇

Quantize: [c₀, c₁, c₂, c₃, c₄, c₅, c₆, c₇]  (8 bytes)

Each subvector is replaced by the index of its nearest centroid in a learned codebook. With 256 centroids per subspace (8 bits), we achieve 64× compression.

Training: Learning the Codebooks

Before encoding vectors, we train codebooks by running k-means clustering on each subspace independently. K-means is one of those algorithms that sounds simple until you actually try to make it work reliably. Initialization alone can ruin your day.

  1. Extract subvectors: For each training vector, extract the M subvectors
  2. Cluster per subspace: Run k-means on each subspace to learn k centroids (typically k=256)
  3. Store codebooks: Keep the M × k × d centroid coordinates

Training complexity is O(M × iterations × n × k × d), where n is the number of training vectors. Since each subspace is independent, this can be parallelized across M subspaces.

Example: Training on 1 million 128-dimensional vectors with M=8, k=256, 25 iterations:

  • 8 independent k-means runs
  • Each: 25 × 1,000,000 × 256 × 16 distance computations
  • Total: ~800 billion operations (but only 100 billion per subspace, parallelizable)

Encoding: Vector → Code

Once trained, encoding is straightforward:

For each subspace m in 0..M:
    subvector = vector[m*d : (m+1)*d]
    code[m] = argmin_k ||subvector - centroid[m][k]||²

Each subvector is assigned to its nearest centroid. The output is M integers in range [0, k), stored as bytes when k=256.

Compression ratio: D floats (4D bytes) → M bytes = 4D/M compression. For D=128, M=8: 64x compression. Let that sink in. Your 512 GB database just became 8 GB. That fits on a single machine.

Decoding: Code → Approximate Vector

Decoding reconstructs an approximate vector by concatenating centroids:

reconstructed[m*d : (m+1)*d] = centroid[m][code[m]]

The reconstruction is lossy: the original vector is approximated by its nearest codebook entries. Quantization error depends on how well the codebooks capture the data distribution.

Distance Computation

The power of PQ comes from efficient distance computation without decoding. Two approaches exist:

Asymmetric Distance Computation (ADC)

For comparing a raw query to encoded database vectors:

  1. Precompute: For each subspace m, compute distance from query subvector to all k centroids → M×k table entries
  2. Search: For each encoded database vector, sum M table lookups
Precompute table:
    table[m][k] = ||query[m*d:(m+1)*d] - centroid[m][k]||²

Compute distance to code:
    dist = Σ_m table[m][code[m]]

Complexity: O(M×k×d) precomputation + O(M) per distance. For M=8, k=256, d=16: 32,768 operations to build table, then just 8 table lookups per database vector.

This is the standard approach for search: precompute once per query, then scan encoded database with cheap lookups.

Symmetric Distance Computation (SDC)

For comparing encoded vectors to each other:

  1. Precompute (during training): For each subspace, compute pairwise centroid distances → M×k×k table
  2. Search: For each pair of codes, sum M table lookups
Precompute table (once, during training):
    sdc[m][i][j] = ||centroid[m][i] - centroid[m][j]||²

Compute distance between codes:
    dist = Σ_m sdc[m][code1[m]][code2[m]]

Storage: M×k×k floats. For M=8, k=256: 8×256×256×4 = 2 MB per quantizer.

ADC keeps one foot in reality: raw query, compressed database. SDC jumps in with both feet, compressing everything. SDC is useful when both vectors are already encoded, for example when a ProductQuantizer itself serves as an EmbeddingSpace where the “embeddings” are PQ codes.

Trade-offs

Compression vs Accuracy

MBits/dimCompressionAccuracy
D (no PQ)32Exact
D/48Very good
D/84Good
D/16216×Moderate

More subquantizers (larger M) means finer quantization and better accuracy, but larger codes. There’s no formula for picking the right M. It’s part math, part intuition, part running it and seeing what happens.

Codebook Size

kBits/subspaceTraining costAccuracy
164LowCoarse
2568MediumGood
102410HighBetter
6553616Very highFine

More centroids capture the distribution better but require more training data (at least k points per subspace) and larger distance tables.

Typical Configuration

For 128-dimensional vectors:

  • M=8 subquantizers (16 dims each)
  • k=256 centroids (8 bits)
  • Code size: 8 bytes
  • Compression: 64×
  • ADC table: 8×256×4 = 8 KB per query

That’s a remarkably capable setup: 64x compression with fast ADC lookups, and it fits neatly into a few kilobytes of scratch space per query. Hard to ask for much more.

When to Use PQ

Good use cases:

  • Memory-constrained environments
  • Large-scale search (millions to billions of vectors)
  • Approximate results are acceptable
  • Query latency matters more than perfect recall

Not ideal for:

  • Small datasets: if your dataset fits in RAM, just use brute force. Seriously. PQ is for when brute force makes you cry.
  • High-precision requirements
  • Very low dimensions (overhead isn’t worth it)

In bytesandbrains

The bb-codec crate provides a ProductQuantizer implementation. See Product Quantization: Rust Implementation Deep Dive for details on the implementation, including:

  • Const-generic PQCode<M, NBITS> for compile-time code layout
  • PQDistanceTable for ADC
  • SDCTable for symmetric distance
  • Integration with the Codec and EmbeddingSpace traits

Further Reading