Mar 15, 2026

Product Quantization: Rust Implementation Deep Dive

The theory of Product Quantization is elegant. Implementing it with Rust’s const generics is… also elegant, once you stop fighting the compiler. Here’s how we did it.

This post walks through the Product Quantization implementation in bb-codec. If you’re new to PQ, start with Product Quantization: Vector Compression for Fast Search for the theory.

Architecture Overview

The bb-codec crate is organized around the ProductQuantizer type:

bb-codec/src/
├── lib.rs              # Re-exports
└── pq/
    ├── mod.rs          # ProductQuantizer
    ├── code.rs         # PQCode<M, NBITS>
    ├── distance.rs     # PQDistanceTable (ADC)
    └── sdc.rs          # SDCTable (symmetric distance)

The key design decisions:

Const generics encode M and NBITS at compile time, so the compiler catches “oops I used 8 subquantizers with a codebook trained for 16” before you even run the code
No heap allocation for codes: fixed-size arrays
Codec trait integration for pluggable compression
EmbeddingSpace implementation for direct use as a metric space

PQCode: The Compressed Representation

PQCode<M, NBITS> stores M centroid indices, each using NBITS bits (byte-aligned):

pub struct PQCode<const M: usize, const NBITS: usize>
where
    [(); bytes_for_nbits(NBITS)]:,
{
    pub codes: [[u8; bytes_for_nbits(NBITS)]; M],
}

The where clause looks like line noise at first, but it’s doing something beautiful. bytes_for_nbits(NBITS) computes ceil(NBITS/8) at compile time, and the bound [(); bytes_for_nbits(NBITS)]: proves at compile time that your quantizer configuration actually fits in the code type. If it doesn’t, you find out before anything runs.

Why this design?

Fixed size: PQCode<8, 8> is exactly 8 bytes, PQCode<16, 10> is 32 bytes, no heap allocation
Type safety: PQCode<8, 8> and PQCode<16, 8> are different types, so dimension mismatches are compile errors
Flexible bit widths: NBITS=10 gives 1024 centroids without wasting space (2 bytes vs 4)

Accessing Indices

Getting and setting indices handles the byte-level encoding:

impl<const M: usize, const NBITS: usize> PQCode<M, NBITS> {
    pub fn get(&self, m: usize) -> u32 {
        let mut value = 0u32;
        for (i, &byte) in self.codes[m].iter().enumerate() {
            value |= (byte as u32) << (i * 8);
        }
        value
    }

    pub fn set(&mut self, m: usize, value: u32) {
        for i in 0..Self::BYTES_PER_CODE {
            self.codes[m][i] = ((value >> (i * 8)) & 0xFF) as u8;
        }
    }
}

The storage is little-endian: for NBITS=10 (1024 centroids), index 500 is stored as [0xF4, 0x01].

Embedding Trait Implementation

PQCode implements Embedding for compatibility with Index and other abstractions:

impl<const M: usize, const NBITS: usize> Embedding for PQCode<M, NBITS> {
    type Scalar = u8;

    fn length() -> usize {
        Self::TOTAL_BYTES
    }

    fn as_slice(&self) -> &[Self::Scalar] {
        // Safe: [[u8; B]; M] has same layout as [u8; M*B]
        unsafe {
            std::slice::from_raw_parts(
                self.codes.as_ptr() as *const u8,
                Self::TOTAL_BYTES
            )
        }
    }
}

Yes, there’s unsafe in here. We checked the math three times. The compile-time assertion is our safety net — if the layout assumption is wrong, this won’t compile.

ProductQuantizer

ProductQuantizer manages codebooks and provides encode/decode operations:

pub struct ProductQuantizer<S: EmbeddingSpace, const M: usize, const NBITS: usize>
where
    [(); bytes_for_nbits(NBITS)]:,
{
    space: S,
    dsub: usize,                            // Dimension per subspace
    d: usize,                               // Total dimension
    centroids: Arc<Vec<f32>>,               // M * KSUB * dsub centroids, Arc for cheap clone
    sdc_table: Option<Arc<SDCTable<M, NBITS>>>,
    trained: bool,
    next_op_id: u64,
    subvec_buffer: Vec<f32>,                // Reusable buffer to avoid allocation
}

The space is owned (not borrowed), so a ProductQuantizer has no lifetime parameter and can be stored anywhere. Centroids and the SDC table are wrapped in Arc so cloning a trained quantizer across actors is effectively a pointer copy.

Training

Training runs k-means++ on each subspace independently:

pub fn train_on(&mut self, data: &[S::EmbeddingData]) {
    let ksub = Self::KSUB;

    // Allocate flat centroid storage: M * ksub * dsub
    let mut centroids = vec![0.0; M * ksub * self.dsub];

    // Train each subspace independently
    for subspace in 0..M {
        // Extract subvectors
        let subvectors: Vec<Vec<f32>> = data
            .iter()
            .map(|emb| {
                let slice = emb.as_slice();
                let start = subspace * self.dsub;
                slice[start..start + self.dsub]
                    .iter()
                    .map(|&s| s.into())
                    .collect()
            })
            .collect();

        // Run k-means++
        let kmeans = KMeans::fit(&subvectors, ksub, 25);

        // Copy centroids to flat storage
        for (k, centroid) in kmeans.centroids.iter().enumerate() {
            let offset = (subspace * ksub + k) * self.dsub;
            centroids[offset..offset + self.dsub]
                .copy_from_slice(centroid);
        }
    }

    // Build SDC table using the embedding space's distance metric.
    let sdc = SDCTable::from_centroids_with_distance(
        &centroids,
        self.dsub,
        S::slice_distance,
    );
    self.centroids = Arc::new(centroids);
    self.sdc_table = Some(Arc::new(sdc));
    self.trained = true;
}

The centroids are stored flat for cache-friendly access during encoding.

Encoding

Encoding assigns each subvector to its nearest centroid:

pub fn encode_embedding(&mut self, embedding: &S::EmbeddingData) -> PQCode<M, NBITS> {
    let slice = embedding.as_slice();
    let mut code = PQCode::<M, NBITS>::zeros();

    for subspace in 0..M {
        self.fill_subvec_buffer(slice, subspace);
        let nearest = self.find_nearest_centroid(subspace);
        code.set(subspace, nearest as u32);
    }

    code
}

The subvec_buffer is reused across calls to avoid per-encode allocation. find_nearest_centroid does a brute-force scan over all k centroids in the subspace.

PQDistanceTable: ADC

Asymmetric Distance Computation precomputes query-to-centroid distances:

pub struct PQDistanceTable<S: EmbeddingSpace, const M: usize, const NBITS: usize> {
    table: Vec<S::DistanceValue>,   // M * KSUB entries
    ksub: usize,
}

Building the table:

pub fn build_distance_table(
    &mut self,
    query: &S::EmbeddingData,
) -> PQDistanceTable<S, M, NBITS> {
    let ksub = Self::KSUB;
    let query_slice = query.as_slice();
    let mut table = Vec::with_capacity(M * ksub);

    for subspace in 0..M {
        self.fill_subvec_buffer(query_slice, subspace);

        for k in 0..ksub {
            let centroid_offset = (subspace * ksub + k) * self.dsub;
            let centroid = &self.centroids[centroid_offset..centroid_offset + self.dsub];

            // Delegate to the embedding space's metric so codecs
            // inherit whatever distance the space defines.
            let dist = S::slice_distance(&self.subvec_buffer, centroid);
            table.push(S::DistanceValue::from(dist));
        }
    }

    PQDistanceTable::new(table, ksub)
}

The entire distance computation is table lookups. No floating-point math at query time. This is why PQ is fast — and why the const-generic design matters. The table dimensions are known at compile time. Computing distance to an encoded vector is just M table lookups:

impl<S: EmbeddingSpace, const M: usize, const NBITS: usize>
    PQDistanceTable<S, M, NBITS>
{
    pub fn distance(&self, code: &PQCode<M, NBITS>) -> S::DistanceValue {
        (0..M)
            .map(|m| {
                let c = code.get(m) as usize;
                self.table[m * self.ksub + c]
            })
            .fold(S::DistanceValue::zero(), |acc, d| acc + d)
    }
}

Performance: Building the table is O(M × k × d). Each distance computation is O(M), typically 8 additions for M=8.

SDCTable: Symmetric Distance

When comparing two codes (not a raw query), we use precomputed centroid-to-centroid distances:

pub struct SDCTable<const M: usize, const NBITS: usize> {
    table: Vec<f32>,    // M * KSUB * KSUB entries
    ksub: usize,
}

Built once during training. The squared-L2 constructor is a thin wrapper around a generic builder that accepts any subspace distance function, which keeps the loop reusable for alternative metrics:

impl<const M: usize, const NBITS: usize> SDCTable<M, NBITS> {
    pub fn from_centroids(centroids: &[f32], dsub: usize) -> Self {
        Self::from_centroids_with_distance(centroids, dsub, |a, b| {
            a.iter()
                .zip(b.iter())
                .map(|(x, y)| {
                    let diff = x - y;
                    diff * diff
                })
                .sum()
        })
    }

    pub fn from_centroids_with_distance(
        centroids: &[f32],
        dsub: usize,
        subspace_distance: impl Fn(&[f32], &[f32]) -> f32,
    ) -> Self {
        let ksub = Self::KSUB;
        let mut table = vec![0.0f32; M * ksub * ksub];

        for m in 0..M {
            for i in 0..ksub {
                let ci_offset = (m * ksub + i) * dsub;
                let ci = &centroids[ci_offset..ci_offset + dsub];
                for j in 0..ksub {
                    let cj_offset = (m * ksub + j) * dsub;
                    let cj = &centroids[cj_offset..cj_offset + dsub];

                    table[m * ksub * ksub + i * ksub + j] = subspace_distance(ci, cj);
                }
            }
        }

        Self { table, ksub }
    }
}

Distance between two codes:

pub fn distance(&self, code1: &PQCode<M, NBITS>, code2: &PQCode<M, NBITS>) -> f32 {
    let mut sum = 0.0f32;
    for m in 0..M {
        let i = code1.get(m) as usize;
        let j = code2.get(m) as usize;
        sum += self.table[m * self.ksub * self.ksub + i * self.ksub + j];
    }
    sum
}

Storage: M × k² × 4 bytes. For M=8, k=256: 2 MB.

Codec Trait Integration

ProductQuantizer implements the Codec trait from bb-core:

impl<S: EmbeddingSpace, const M: usize, const NBITS: usize> Codec<S>
    for ProductQuantizer<S, M, NBITS>
{
    type Encoded = PQCode<M, NBITS>;
    type EncodeRef<'a> = EagerOpRef<PQCode<M, NBITS>, PQError>;
    type DecodeRef<'a> = EagerOpRef<S::EmbeddingData, PQError>;
    type TrainRef<'a> = EagerOpRef<(), PQError>;
    type ObserveRef<'a> = EagerOpRef<(), PQError>;

    fn encode(&mut self, embedding: &S::EmbeddingData) -> Self::EncodeRef<'_> {
        if !self.trained {
            return EagerOpRef::err(OpId(0), PQError::NotTrained);
        }
        EagerOpRef::ok(OpId(0), self.encode_embedding(embedding))
    }

    fn decode(&self, encoded: &Self::Encoded) -> Self::DecodeRef<'_> {
        if !self.trained {
            return EagerOpRef::err(OpId(0), PQError::NotTrained);
        }
        EagerOpRef::ok(OpId(0), self.decode_code(encoded))
    }

    fn train(&mut self, embeddings: &[S::EmbeddingData]) -> Self::TrainRef<'_> {
        self.train_on(embeddings);
        EagerOpRef::ok(self.alloc_op_id(), ())
    }

    fn is_trained(&self) -> bool {
        self.trained
    }

    // `encode_batch`, `decode_batch`, `observe`, `observe_batch`, `code_size` elided.
}

EagerOpRef is a synchronous OpRef implementation for local operations. The operation completes immediately, but the interface matches async network codecs.

EmbeddingSpace Implementation

ProductQuantizer also implements EmbeddingSpace directly for PQCode<M, NBITS>:

impl<S: EmbeddingSpace, const M: usize, const NBITS: usize> EmbeddingSpace
    for ProductQuantizer<S, M, NBITS>
{
    type EmbeddingData = PQCode<M, NBITS>;
    type DistanceValue = S::DistanceValue;

    fn distance(&self, lhs: &Self::EmbeddingData, rhs: &Self::EmbeddingData) -> Self::DistanceValue {
        let sdc = self.sdc_table.as_ref()
            .expect("must be trained");
        S::DistanceValue::from(sdc.distance(lhs, rhs))
    }
}

This enables using a trained ProductQuantizer as the embedding space for a FlatIndex:

// Index stores PQ codes directly
let mut pq = ProductQuantizer::<_, 8, 8>::new(space);
pq.train_on(&training_data);

let mut index = FlatIndex::new(&pq);
for vec in data {
    let code = pq.encode_embedding(&vec);
    index.add(&code, value, &());
}

// Search with codes
let query_code = pq.encode_embedding(&query);
let neighbors = index.search_knn(&query_code, 10);

Usage Example

use bb_codec::{ProductQuantizer, PQCode};
use bb_core::{F32L2Space, F32Embedding};

// Create quantizer for 128-dim vectors, 8 subspaces, 256 centroids
let space = F32L2Space::<128>;
let mut pq = ProductQuantizer::<_, 8, 8>::new(space);

// Train on representative data
pq.train_on(&training_vectors);

// Encode vectors
let code: PQCode<8, 8> = pq.encode_embedding(&vector);
// code is 8 bytes vs 512 bytes original

// ADC: search with raw query
let table = pq.build_distance_table(&query);
for code in encoded_database {
    let dist = table.distance(&code);  // 8 table lookups
}

// SDC: compare codes directly
let sdc = pq.sdc_table().unwrap();
let dist = sdc.distance(&code1, &code2);

// Decode to approximate vector
let reconstructed = pq.decode_code(&code);

PQ compression is one of those building blocks that quietly makes everything else possible. Without it, high-dimensional embeddings in peer-to-peer overlay networks would be too expensive to transmit.