Stage 1 rotates the vector and quantizes each coordinate to a small number of bits using an optimal scalar quantizer matched to the resulting Beta distribution. This captures most of the signal but introduces some quantization error.
Stage 2 takes the leftover error from Stage 1, which is approximately isotropic due to the random rotation and projects it onto many random directions, storing just the sign (+1 or -1) of each. At query time, these bits provide a correction that brings the estimate closer to the true answer. Because Stage 1 already captured most of the signal, the leftover is small, so even a compact set of sign bits is enough.
Below are the seven classical results that make this work.
Concentration of Measure
This is the foundational geometric fact. If \(u\) is uniform on the sphere \(\mathbb{S}^{d-1}\) and \(x\) is fixed, then the scalar projection \(\langle u, x \rangle\) has a sharply concentrated distribution. For large \(d\), it is well approximated by a Gaussian with variance \(\|x\|^2/d\):
Standard spherical concentration gives a subgaussian tail of the form \(\;\mathbb{P}\!\left(\left|\langle u,x\rangle\right| > \varepsilon\right) \leq 2\exp\!\left(-c\,\frac{d\,\varepsilon^2}{\|x\|^2}\right)\) for an absolute constant \(c>0\).
The tail probability decays exponentially in \(d\). This is not intuitive from everyday experience: it is purely a high-dimensional phenomenon.
Distribution of projections for a random unit vector. Orange dashed lines mark the theoretical spread. Drag d up and watch the distribution collapse inward. The concentration is already visible at d=64; by d=256 the spread is tiny.
CLT & Beta Distribution of Rotated Coordinates
Each coordinate of \(Rx\) is a linear combination of all \(d\) original coordinates. By the Central Limit Theorem, sums of many weakly-correlated terms converge to Gaussian. More precisely, the normalised squared coordinates follow a Beta distribution that concentrates around equal energy per dimension:
Exact: \(\quad\frac{(Rx)_i^2}{\|x\|^2} \;\sim\; \mathrm{Beta}\!\left(\tfrac{1}{2},\,\tfrac{d-1}{2}\right)\)
Equivalent rescaling: \(\;\frac{d\,(Rx)_i^2}{\|x\|^2} \sim d\cdot \mathrm{Beta}\!\left(\tfrac{1}{2},\,\tfrac{d-1}{2}\right)\)
The coordinates \((Rx)_i\) are exchangeable and pairwise uncorrelated (\(\mathbb{E}[(Rx)_i(Rx)_j] = 0\) for \(i \neq j\)). Each coordinate is approximately Gaussian for large \(d\), and distinct coordinates behave nearly independently — not truly independent because the sphere constrains the sum of squares, but in high dimensions this coupling is negligible for scalar quantisation purposes.
Left (pink): first coordinate of a spiky vector across 3000 draws, heavy tailed and not bell shaped. Right (green): same coordinate after random rotation. Green curve: fitted bell curve. Increase d and the fit tightens towards the theoretical Gaussian.
Rate-Distortion Theory
Every bit we spend per number buys us more accuracy. For Gaussian distributed data under squared error, Shannon's theorem tells us exactly how much: double the bits, quarter the error. No scheme can beat this floor for that setting.
How close does optimal scalar quantization get?
\(\;D_{\text{scalar}} \leq \dfrac{\pi e}{6} \cdot D^*(R) \approx D^*(R) + 1.53\;\text{dB}\)
Within a factor of ~1.5 dB of the theoretical best. This is a classical result for data with a known distribution, and should be read as an asymptotic benchmark rather than an exact guarantee at every bit rate.
For data with an unknown or spiky distribution (like raw activation vectors), scalar quantization does much worse. The rotation from section 02 reshapes the data into a known Beta distribution so an optimal quantizer can be designed and this near-optimal guarantee kicks in.
Green: Shannon's floor, the best any scheme can do. Blue: optimal scalar quantization after rotation, close to optimal. Pink: scalar quantization on the original spiky data, noticeably worse. The orange vertical marks one TurboQuant operating point (3.5 bits per dimension).
Johnson-Lindenstrauss Lemma
Pick two random arrows in a high dimensional space. In 2D, the angle between them could be anything. In high dimensions, it is almost always close to 90°. This is why random projections work: the projection directions do not pile up, so the projection sees the data evenly.
Histogram of angles between random pairs of arrows. Drag d up and watch the distribution collapse onto 90°. In 2D the angle could be anything; by d=100 nearly every pair is almost perpendicular.
Like asking \(k\) independent surveyors to estimate the same distance: with few the estimate is noisy, with many it locks in. The failure probability drops exponentially:
\[\mathbb{P}\!\left(\left|\|Jx\|^2 - \|x\|^2\right| > \varepsilon\|x\|^2\right) \;\leq\; 2\exp\!\left(-\frac{k\varepsilon^2}{4}\right)\]Each sample is one random projection. The x axis shows the ratio of projected distance to true distance (1.0 means perfect). Orange lines mark the accuracy bounds. Increase k and watch the distribution collapse around 1. More projection dimensions means tighter estimates.
For \(n\) points there are \(\sim n^2/2\) pairs. Sum the failure chances from Point 2 over all pairs:
\[n^2 \cdot e^{-k\varepsilon^2/4} \;\leq\; \delta \quad\Longrightarrow\quad k \;\geq\; \frac{8\log n + 4\log(1/\delta)}{\varepsilon^2} = O\!\left(\frac{\log n}{\varepsilon^2}\right)\]The original dimension \(d\) never appears. The \(\log n\) pays for protecting all pairs; the \(1/\varepsilon^2\) pays for tightness.
Green: chance that any pair fails. Orange: 1% safety threshold. Try adding more points or tightening the accuracy to see how the required number of projection dimensions changes. The required dimensions grow with the logarithm of the number of points, not proportionally.
Arc-Cosine Identity for Gaussian Projections
where \(\theta(r,q) = \arccos\!\left(\frac{\langle r,q\rangle}{\|r\|\|q\|}\right)\). The key mixed expectation:
\(\mathbb{E}\!\left[\mathrm{sign}(j^\top r)\cdot(j^\top q)\right] = \sqrt{\dfrac{2}{\pi}}\,\dfrac{\langle r,\,q\rangle}{\|r\|}\)
Encoding: project the leftover error from Stage 1 onto many random directions and store just the sign (+1 or -1) of each. That is all we keep.
Decoding (query time): recreate the same directions from the shared seed and project the query onto each one. Multiply each stored sign by the query's projection in that direction, then average all the products. That average estimates similarity, off by a known scaling factor that we simply multiply away.
Green curve: exact formula. Blue dots: measured from random projections. The match is exact in expectation. Increase samples and the dots collapse onto the curve.
Law of Large Numbers
\(\mathbb{E}[\hat{\ell}_m] = \sqrt{\tfrac{2}{\pi}}\,\dfrac{\langle r,q\rangle}{\|r\|} \qquad\text{(unbiased up to known constant)}\)
\(\mathrm{Var}[\hat{\ell}_m] \;\propto\; \dfrac{1}{m}\)
Blue: measured noise. Orange: theory (double the bits, halve the noise). The slopes match.
Shared Randomness
\(R,\, J\): stored once as a random seed (small, fixed size).
■ Stored per vector. ■ Regenerated from seed (free). ■ Extra metadata competitors need. TurboQuant eliminates the orange column.
How They Chain Together
None of these are new mathematics. The elegance of TurboQuant is composing seven classical results in exactly the right order.
Putting It All Together
Now that we have seen the seven ingredients, here is the full recipe.
- Generate shared random operators \(R\) and \(J\) from a seed.
- Rotate the KV vector: \(y = Rx\).
- Quantise \(y\) coordinatewise with a fixed scalar grid to obtain Stage 1.
- Form the residual \(r = x - R^\top \tilde y\), then store \(\mathrm{sign}(Jr)\) as Stage 2.
- At query time, combine the Stage 1 reconstruction with a calibrated estimator from \(\mathrm{sign}(Jr)\) and \(Jq\) to approximate \(\langle x,q\rangle\).
References & further reading
- Spherical concentration. Vershynin, High-Dimensional Probability
- Coordinate laws on the sphere. Diaconis & Freedman, "Asymptotics of graphical projection pursuit" (1984).
- Gaussian rate-distortion. Shannon, "Coding theorems for a discrete source with a fidelity criterion," IRE National Convention Record (1959). High-rate gap: Gersho, "Asymptotically optimal block quantization" (1979).
- Johnson-Lindenstrauss embeddings. Johnson & Lindenstrauss (1984). Dasgupta & Gupta, "An elementary proof of a theorem of Johnson and Lindenstrauss" (2003).
- Arc-cosine / random hyperplane identity. Goemans & Williamson, "Improved approximation algorithms for maximum cut" (1995).
- QJL estimator. Zandieh, Daliri, Hadian & Mirrokni, "TurboQuant" (ICLR 2026), arXiv:2504.19874.
- PolarQuant. Han, Kacham, Karbasi, Mirrokni & Zandieh, "PolarQuant: Quantizing the KV Cache Using the Polar Coordinate System" (2025), arXiv:2502.02617.