Product Quantization: Rust Implementation Deep Dive
The theory of Product Quantization is elegant. Implementing it with Rust’s const generics is… also elegant, once you stop fighting the compiler. Here’s how we did it.
This post walks through the Product Quantization implementation in bb-codec. If you’re new to PQ, start with Product Quantization: Vector Compression for Fast Search for the theory.
Architecture Overview
The bb-codec crate is organized around the ProductQuantizer type:
bb-codec/src/
├── lib.rs # Re-exports
└── pq/
├── mod.rs # ProductQuantizer
├── code.rs # PQCode<M, NBITS>
├── distance.rs # PQDistanceTable (ADC)
└── sdc.rs # SDCTable (symmetric distance)
The key design decisions:
- Const generics encode M and NBITS at compile time, so the compiler catches “oops I used 8 subquantizers with a codebook trained for 16” before you even run the code
- No heap allocation for codes: fixed-size arrays
- Codec trait integration for pluggable compression
- EmbeddingSpace implementation for direct use as a metric space
PQCode: The Compressed Representation
PQCode<M, NBITS> stores M centroid indices, each using NBITS bits (byte-aligned):
pub struct PQCode<const M: usize, const NBITS: usize>
where
[(); bytes_for_nbits(NBITS)]:,
{
pub codes: [[u8; bytes_for_nbits(NBITS)]; M],
}
The where clause looks like line noise at first, but it’s doing something beautiful. bytes_for_nbits(NBITS) computes ceil(NBITS/8) at compile time, and the bound [(); bytes_for_nbits(NBITS)]: proves at compile time that your quantizer configuration actually fits in the code type. If it doesn’t, you find out before anything runs.
Why this design?
- Fixed size:
PQCode<8, 8>is exactly 8 bytes,PQCode<16, 10>is 32 bytes, no heap allocation - Type safety:
PQCode<8, 8>andPQCode<16, 8>are different types, so dimension mismatches are compile errors - Flexible bit widths: NBITS=10 gives 1024 centroids without wasting space (2 bytes vs 4)
Accessing Indices
Getting and setting indices handles the byte-level encoding:
impl<const M: usize, const NBITS: usize> PQCode<M, NBITS> {
pub fn get(&self, m: usize) -> u32 {
let mut value = 0u32;
for (i, &byte) in self.codes[m].iter().enumerate() {
value |= (byte as u32) << (i * 8);
}
value
}
pub fn set(&mut self, m: usize, value: u32) {
for i in 0..Self::BYTES_PER_CODE {
self.codes[m][i] = ((value >> (i * 8)) & 0xFF) as u8;
}
}
}
The storage is little-endian: for NBITS=10 (1024 centroids), index 500 is stored as [0xF4, 0x01].
Embedding Trait Implementation
PQCode implements Embedding for compatibility with Index and other abstractions:
impl<const M: usize, const NBITS: usize> Embedding for PQCode<M, NBITS> {
type Scalar = u8;
fn length() -> usize {
Self::TOTAL_BYTES
}
fn as_slice(&self) -> &[Self::Scalar] {
// Safe: [[u8; B]; M] has same layout as [u8; M*B]
unsafe {
std::slice::from_raw_parts(
self.codes.as_ptr() as *const u8,
Self::TOTAL_BYTES
)
}
}
}
Yes, there’s unsafe in here. We checked the math three times. The compile-time assertion is our safety net — if the layout assumption is wrong, this won’t compile.
ProductQuantizer
ProductQuantizer manages codebooks and provides encode/decode operations:
pub struct ProductQuantizer<S: EmbeddingSpace, const M: usize, const NBITS: usize>
where
[(); bytes_for_nbits(NBITS)]:,
{
space: S,
dsub: usize, // Dimension per subspace
d: usize, // Total dimension
centroids: Arc<Vec<f32>>, // M * KSUB * dsub centroids, Arc for cheap clone
sdc_table: Option<Arc<SDCTable<M, NBITS>>>,
trained: bool,
next_op_id: u64,
subvec_buffer: Vec<f32>, // Reusable buffer to avoid allocation
}
The space is owned (not borrowed), so a ProductQuantizer has no lifetime parameter and can be stored anywhere. Centroids and the SDC table are wrapped in Arc so cloning a trained quantizer across actors is effectively a pointer copy.
Training
Training runs k-means++ on each subspace independently:
pub fn train_on(&mut self, data: &[S::EmbeddingData]) {
let ksub = Self::KSUB;
// Allocate flat centroid storage: M * ksub * dsub
let mut centroids = vec![0.0; M * ksub * self.dsub];
// Train each subspace independently
for subspace in 0..M {
// Extract subvectors
let subvectors: Vec<Vec<f32>> = data
.iter()
.map(|emb| {
let slice = emb.as_slice();
let start = subspace * self.dsub;
slice[start..start + self.dsub]
.iter()
.map(|&s| s.into())
.collect()
})
.collect();
// Run k-means++
let kmeans = KMeans::fit(&subvectors, ksub, 25);
// Copy centroids to flat storage
for (k, centroid) in kmeans.centroids.iter().enumerate() {
let offset = (subspace * ksub + k) * self.dsub;
centroids[offset..offset + self.dsub]
.copy_from_slice(centroid);
}
}
// Build SDC table using the embedding space's distance metric.
let sdc = SDCTable::from_centroids_with_distance(
¢roids,
self.dsub,
S::slice_distance,
);
self.centroids = Arc::new(centroids);
self.sdc_table = Some(Arc::new(sdc));
self.trained = true;
}
The centroids are stored flat for cache-friendly access during encoding.
Encoding
Encoding assigns each subvector to its nearest centroid:
pub fn encode_embedding(&mut self, embedding: &S::EmbeddingData) -> PQCode<M, NBITS> {
let slice = embedding.as_slice();
let mut code = PQCode::<M, NBITS>::zeros();
for subspace in 0..M {
self.fill_subvec_buffer(slice, subspace);
let nearest = self.find_nearest_centroid(subspace);
code.set(subspace, nearest as u32);
}
code
}
The subvec_buffer is reused across calls to avoid per-encode allocation. find_nearest_centroid does a brute-force scan over all k centroids in the subspace.
PQDistanceTable: ADC
Asymmetric Distance Computation precomputes query-to-centroid distances:
pub struct PQDistanceTable<S: EmbeddingSpace, const M: usize, const NBITS: usize> {
table: Vec<S::DistanceValue>, // M * KSUB entries
ksub: usize,
}
Building the table:
pub fn build_distance_table(
&mut self,
query: &S::EmbeddingData,
) -> PQDistanceTable<S, M, NBITS> {
let ksub = Self::KSUB;
let query_slice = query.as_slice();
let mut table = Vec::with_capacity(M * ksub);
for subspace in 0..M {
self.fill_subvec_buffer(query_slice, subspace);
for k in 0..ksub {
let centroid_offset = (subspace * ksub + k) * self.dsub;
let centroid = &self.centroids[centroid_offset..centroid_offset + self.dsub];
// Delegate to the embedding space's metric so codecs
// inherit whatever distance the space defines.
let dist = S::slice_distance(&self.subvec_buffer, centroid);
table.push(S::DistanceValue::from(dist));
}
}
PQDistanceTable::new(table, ksub)
}
The entire distance computation is table lookups. No floating-point math at query time. This is why PQ is fast — and why the const-generic design matters. The table dimensions are known at compile time. Computing distance to an encoded vector is just M table lookups:
impl<S: EmbeddingSpace, const M: usize, const NBITS: usize>
PQDistanceTable<S, M, NBITS>
{
pub fn distance(&self, code: &PQCode<M, NBITS>) -> S::DistanceValue {
(0..M)
.map(|m| {
let c = code.get(m) as usize;
self.table[m * self.ksub + c]
})
.fold(S::DistanceValue::zero(), |acc, d| acc + d)
}
}
Performance: Building the table is O(M × k × d). Each distance computation is O(M), typically 8 additions for M=8.
SDCTable: Symmetric Distance
When comparing two codes (not a raw query), we use precomputed centroid-to-centroid distances:
pub struct SDCTable<const M: usize, const NBITS: usize> {
table: Vec<f32>, // M * KSUB * KSUB entries
ksub: usize,
}
Built once during training. The squared-L2 constructor is a thin wrapper around a generic builder that accepts any subspace distance function, which keeps the loop reusable for alternative metrics:
impl<const M: usize, const NBITS: usize> SDCTable<M, NBITS> {
pub fn from_centroids(centroids: &[f32], dsub: usize) -> Self {
Self::from_centroids_with_distance(centroids, dsub, |a, b| {
a.iter()
.zip(b.iter())
.map(|(x, y)| {
let diff = x - y;
diff * diff
})
.sum()
})
}
pub fn from_centroids_with_distance(
centroids: &[f32],
dsub: usize,
subspace_distance: impl Fn(&[f32], &[f32]) -> f32,
) -> Self {
let ksub = Self::KSUB;
let mut table = vec![0.0f32; M * ksub * ksub];
for m in 0..M {
for i in 0..ksub {
let ci_offset = (m * ksub + i) * dsub;
let ci = ¢roids[ci_offset..ci_offset + dsub];
for j in 0..ksub {
let cj_offset = (m * ksub + j) * dsub;
let cj = ¢roids[cj_offset..cj_offset + dsub];
table[m * ksub * ksub + i * ksub + j] = subspace_distance(ci, cj);
}
}
}
Self { table, ksub }
}
}
Distance between two codes:
pub fn distance(&self, code1: &PQCode<M, NBITS>, code2: &PQCode<M, NBITS>) -> f32 {
let mut sum = 0.0f32;
for m in 0..M {
let i = code1.get(m) as usize;
let j = code2.get(m) as usize;
sum += self.table[m * self.ksub * self.ksub + i * self.ksub + j];
}
sum
}
Storage: M × k² × 4 bytes. For M=8, k=256: 2 MB.
Codec Trait Integration
ProductQuantizer implements the Codec trait from bb-core:
impl<S: EmbeddingSpace, const M: usize, const NBITS: usize> Codec<S>
for ProductQuantizer<S, M, NBITS>
{
type Encoded = PQCode<M, NBITS>;
type EncodeRef<'a> = EagerOpRef<PQCode<M, NBITS>, PQError>;
type DecodeRef<'a> = EagerOpRef<S::EmbeddingData, PQError>;
type TrainRef<'a> = EagerOpRef<(), PQError>;
type ObserveRef<'a> = EagerOpRef<(), PQError>;
fn encode(&mut self, embedding: &S::EmbeddingData) -> Self::EncodeRef<'_> {
if !self.trained {
return EagerOpRef::err(OpId(0), PQError::NotTrained);
}
EagerOpRef::ok(OpId(0), self.encode_embedding(embedding))
}
fn decode(&self, encoded: &Self::Encoded) -> Self::DecodeRef<'_> {
if !self.trained {
return EagerOpRef::err(OpId(0), PQError::NotTrained);
}
EagerOpRef::ok(OpId(0), self.decode_code(encoded))
}
fn train(&mut self, embeddings: &[S::EmbeddingData]) -> Self::TrainRef<'_> {
self.train_on(embeddings);
EagerOpRef::ok(self.alloc_op_id(), ())
}
fn is_trained(&self) -> bool {
self.trained
}
// `encode_batch`, `decode_batch`, `observe`, `observe_batch`, `code_size` elided.
}
EagerOpRef is a synchronous OpRef implementation for local operations. The operation completes immediately, but the interface matches async network codecs.
EmbeddingSpace Implementation
ProductQuantizer also implements EmbeddingSpace directly for PQCode<M, NBITS>:
impl<S: EmbeddingSpace, const M: usize, const NBITS: usize> EmbeddingSpace
for ProductQuantizer<S, M, NBITS>
{
type EmbeddingData = PQCode<M, NBITS>;
type DistanceValue = S::DistanceValue;
fn distance(&self, lhs: &Self::EmbeddingData, rhs: &Self::EmbeddingData) -> Self::DistanceValue {
let sdc = self.sdc_table.as_ref()
.expect("must be trained");
S::DistanceValue::from(sdc.distance(lhs, rhs))
}
}
This enables using a trained ProductQuantizer as the embedding space for a FlatIndex:
// Index stores PQ codes directly
let mut pq = ProductQuantizer::<_, 8, 8>::new(space);
pq.train_on(&training_data);
let mut index = FlatIndex::new(&pq);
for vec in data {
let code = pq.encode_embedding(&vec);
index.add(&code, value, &());
}
// Search with codes
let query_code = pq.encode_embedding(&query);
let neighbors = index.search_knn(&query_code, 10);
Usage Example
use bb_codec::{ProductQuantizer, PQCode};
use bb_core::{F32L2Space, F32Embedding};
// Create quantizer for 128-dim vectors, 8 subspaces, 256 centroids
let space = F32L2Space::<128>;
let mut pq = ProductQuantizer::<_, 8, 8>::new(space);
// Train on representative data
pq.train_on(&training_vectors);
// Encode vectors
let code: PQCode<8, 8> = pq.encode_embedding(&vector);
// code is 8 bytes vs 512 bytes original
// ADC: search with raw query
let table = pq.build_distance_table(&query);
for code in encoded_database {
let dist = table.distance(&code); // 8 table lookups
}
// SDC: compare codes directly
let sdc = pq.sdc_table().unwrap();
let dist = sdc.distance(&code1, &code2);
// Decode to approximate vector
let reconstructed = pq.decode_code(&code);
PQ compression is one of those building blocks that quietly makes everything else possible. Without it, high-dimensional embeddings in peer-to-peer overlay networks would be too expensive to transmit.
Further Reading
- Product Quantization: Vector Compression for Fast Search: the theory behind PQ
- K-Means Clustering: how codebooks are trained
- The bytesandbrains Library: library overview