K-Means Clustering
K-means is the kind of algorithm that takes five minutes to explain and five weeks to get working reliably. It’s been around since the 1950s and we’re still finding new ways to break it. The problem itself is straightforward: given a set of data points, partition them into groups such that points within each group are more similar to each other than to points in other groups. K-means tackles this with a deceptively simple loop, and a surprising number of ways to go wrong.
The Problem
Given n data points in d-dimensional space, find k cluster centers (centroids) such that each point is assigned to its nearest centroid, minimizing the total within-cluster variance:
where Ci is the set of points assigned to cluster i and μi is the centroid of cluster i.
This optimization problem is NP-hard in general, but the k-means algorithm finds a local optimum through iterative refinement.
Lloyd’s Algorithm
The standard k-means algorithm, often called Lloyd’s algorithm, alternates between two steps:
- Assignment: Assign each point to the nearest centroid
- Update: Recompute each centroid as the mean of its assigned points
Initialize: Choose k initial centroids μ₁, μ₂, ..., μₖ
Repeat until convergence:
# Assignment step
For each point x:
Assign x to cluster argmin_i ||x - μᵢ||²
# Update step
For each cluster i:
μᵢ = mean of all points assigned to cluster i
Each iteration decreases (or maintains) the objective function, so the algorithm is guaranteed to converge. In practice, it converges to whatever local minimum it feels like, which may or may not be useful. The result depends heavily on the initial centroid placement.
Convergence
The algorithm converges when the centroids stop moving (or move less than some small threshold ε). In practice, convergence typically occurs within 10-100 iterations, though pathological cases exist where convergence takes exponentially long.
The time complexity per iteration is O(nkd), computing distances from n points to k centroids in d dimensions.
The Initialization Problem
Lloyd’s algorithm is sensitive to initialization. Poor initial centroids can lead to:
- Slow convergence: Many iterations needed to reach a good solution
- Suboptimal clusters: Convergence to a local minimum far from the global optimum
- Empty clusters: Centroids placed far from all data points
Consider clustering two well-separated groups. If both initial centroids happen to land in the same group, one will capture all the points while the other drifts aimlessly. This is one of those problems that sounds theoretical until it happens to you at 3am and half your clusters are empty.
Random Initialization
The simplest approach is to choose k random points from the dataset as initial centroids. This is fast but unreliable, and the quality of the final clustering varies significantly across runs.
K-Means++
K-means++ is one of those ideas that seems obvious in hindsight, which is the hallmark of genuinely good algorithm design. It spreads the initial centroids apart, dramatically improving the expected clustering quality. The key insight: choose each new centroid with probability proportional to its squared distance from existing centroids.
1. Choose the first centroid uniformly at random from the data points
2. For each subsequent centroid:
a. Compute D(x)² for each point x, where D(x) is the distance
to the nearest existing centroid
b. Choose the next centroid with probability proportional to D(x)²
3. After k centroids are chosen, run Lloyd's algorithm
The probability weighting ensures that:
- Points far from existing centroids are more likely to be chosen
- The centroids naturally spread across the data
- Outliers have higher (but not guaranteed) selection probability
K-means++ provides a theoretical guarantee: the expected clustering cost is O(log k) times the optimal cost. In practice, it usually finds near-optimal solutions.
Implementation detail: Computing the probability distribution requires tracking D(x) for all points and sampling from a weighted distribution. The total initialization cost is O(nkd), the same as one Lloyd iteration.
Handling Edge Cases
Empty Clusters
Empty clusters are the kind of edge case that textbooks mention in a footnote and production systems encounter daily. If a centroid has no assigned points, the update step has no data to average. Common strategies:
- Reinitialize: Replace the empty centroid with a random data point
- Split largest: Split the largest cluster by choosing a point far from its centroid
- Ignore: Allow the centroid to persist unchanged (may remain empty)
The first approach is simplest and usually sufficient.
Convergence Criteria
Rather than waiting for exact convergence (which may never happen due to floating-point precision), practical implementations use:
- Centroid movement: Stop when all centroids move less than ε
- Assignment stability: Stop when no points change clusters
- Iteration limit: Stop after a maximum number of iterations
Combining these (stop when either centroids stabilize OR the iteration limit is reached) ensures termination while still finding good solutions.
Complexity Analysis
| Phase | Time Complexity |
|---|---|
| K-means++ init | O(nkd) |
| Assignment step | O(nkd) |
| Update step | O(nd) |
| Total per iteration | O(nkd) |
For t iterations, the total complexity is O(nkdt + nkd) = O(nkdt).
Space complexity is O(nd + kd): storage for the data points plus the centroids.
Applications
K-means appears throughout machine learning and data processing:
- Vector quantization: Compress continuous values to discrete codes. This is where k-means really earns its keep, and it’s the basis for Product Quantization
- Image compression: Reduce color palette by clustering pixel values (your JPEG encoder has opinions about centroids whether you know it or not)
- Customer segmentation: Group users by behavior patterns
- Document clustering: Organize text by topic, provided you pick the right k (good luck)
- Initialization for other algorithms: Seed more sophisticated clustering methods. Sometimes the best use of k-means is getting out of the way
All of these share a common prerequisite: you need to pick k. If you figure out a reliable way to choose k automatically, let us know. We’ve been looking. The elbow method, silhouette scores, and gap statistics all help, but none of them are a substitute for actually understanding your data.
In bytesandbrains
The bb-ml crate provides a k-means implementation used internally by bb-codec for training Product Quantization codebooks:
use bb_ml::KMeans;
let data: Vec<Vec<f32>> = /* your data points */;
let kmeans = KMeans::fit(&data, k, max_iterations);
// Predict cluster for a new point
let cluster = kmeans.predict(&new_point);
// Access learned centroids
let centroid = kmeans.centroid(cluster);
The implementation uses k-means++ initialization and Lloyd’s algorithm with convergence detection.
Further Reading
- Product Quantization: Vector Compression for Fast Search: how k-means enables vector compression
- The bytesandbrains Library: library overview
- Arthur & Vassilvitskii, “k-means++: The Advantages of Careful Seeding” (2007): the original k-means++ paper