ICLR 2026

The Maths Behind TurboQuant

LLM inference is bottlenecked by the KV cache. TurboQuant compresses it to around 3 to 3.5 bits per dimension with near-optimal worst-case distortion up to a small constant factor, using only data-oblivious random transforms shared across vectors.

Seven mathematical pillars

01Concentration 02Rotated coordinates 03Rate-distortion 04JL lemma 05Arc-cosine 06Law of Large Numbers 07Shared randomness
How TurboQuant works TurboQuant compresses each KV vector in two stages using only a shared random seed.

Stage 1 rotates the vector and quantizes each coordinate to a small number of bits using an optimal scalar quantizer matched to the resulting Beta distribution. This captures most of the signal but introduces some quantization error.

Stage 2 takes the leftover error from Stage 1, which is approximately isotropic due to the random rotation and projects it onto many random directions, storing just the sign (+1 or -1) of each. At query time, these bits provide a correction that brings the estimate closer to the true answer. Because Stage 1 already captured most of the signal, the leftover is small, so even a compact set of sign bits is enough.

Below are the seven classical results that make this work.
01

Concentration of Measure

Intuition In low dimensions, a random direction on the sphere could point anywhere. In high dimensions, almost all of the sphere's surface area piles up near the equator relative to any fixed axis. The deviations decay exponentially in \(d\). Importantly, high dimensionality makes random quantities predictable.
2D (circle)
spread out
10D (sphere)
mostly near equator
100D
nearly all near equator

This is the foundational geometric fact. If \(u\) is uniform on the sphere \(\mathbb{S}^{d-1}\) and \(x\) is fixed, then the scalar projection \(\langle u, x \rangle\) has a sharply concentrated distribution. For large \(d\), it is well approximated by a Gaussian with variance \(\|x\|^2/d\):

Concentration on the sphere \(\langle u, x \rangle \;\approx\; \mathcal{N}\!\left(0,\,\frac{\|x\|^2}{d}\right)\)

Standard spherical concentration gives a subgaussian tail of the form \(\;\mathbb{P}\!\left(\left|\langle u,x\rangle\right| > \varepsilon\right) \leq 2\exp\!\left(-c\,\frac{d\,\varepsilon^2}{\|x\|^2}\right)\) for an absolute constant \(c>0\).

The tail probability decays exponentially in \(d\). This is not intuitive from everyday experience: it is purely a high-dimensional phenomenon.

V1 How fast projections concentrate as dimension grows

Distribution of projections for a random unit vector. Orange dashed lines mark the theoretical spread. Drag d up and watch the distribution collapse inward. The concentration is already visible at d=64; by d=256 the spread is tiny.

Connection to TurboQuant & attention savings Why this matters for quantisation: If every coordinate of \(Rx\) lives in a predictable band, we can design a single quantisation grid that works for all coordinates. No per-channel scale factors, no calibration pass. In LLaMA-2 70B with \(d{=}128\) heads, that is 128 scale factors we do not store per token.
02

CLT & Beta Distribution of Rotated Coordinates

Intuition Imagine a choir where one singer is far louder than the rest. If we record with one microphone per singer, we'd need a sensitive mic for the loud one and basic mics for everyone else. There would thus be different equipment for each channel. Now imagine mixing the audio so every speaker carries an equal blend of all voices. Suddenly one mic setting works for all of them. That's what a random rotation does to a vector: it spreads the energy evenly across coordinates, so a single quantisation grid fits every dimension.

Each coordinate of \(Rx\) is a linear combination of all \(d\) original coordinates. By the Central Limit Theorem, sums of many weakly-correlated terms converge to Gaussian. More precisely, the normalised squared coordinates follow a Beta distribution that concentrates around equal energy per dimension:

Distribution of rotated coordinates \((Rx)_i \;\overset{d}{\approx}\; \mathcal{N}\!\left(0,\;\frac{\|x\|^2}{d}\right) \quad \text{for large } d\)

Exact: \(\quad\frac{(Rx)_i^2}{\|x\|^2} \;\sim\; \mathrm{Beta}\!\left(\tfrac{1}{2},\,\tfrac{d-1}{2}\right)\)
Equivalent rescaling: \(\;\frac{d\,(Rx)_i^2}{\|x\|^2} \sim d\cdot \mathrm{Beta}\!\left(\tfrac{1}{2},\,\tfrac{d-1}{2}\right)\)

The coordinates \((Rx)_i\) are exchangeable and pairwise uncorrelated (\(\mathbb{E}[(Rx)_i(Rx)_j] = 0\) for \(i \neq j\)). Each coordinate is approximately Gaussian for large \(d\), and distinct coordinates behave nearly independently — not truly independent because the sphere constrains the sum of squares, but in high dimensions this coupling is negligible for scalar quantisation purposes.
V2 Original vs rotated coordinate distributions

Left (pink): first coordinate of a spiky vector across 3000 draws, heavy tailed and not bell shaped. Right (green): same coordinate after random rotation. Green curve: fitted bell curve. Increase d and the fit tightens towards the theoretical Gaussian.

Connection to TurboQuant & attention savings The quantisation payoff: Once every coordinate follows the same known distribution (Beta), we can precompute an optimal scalar quantizer for that distribution. No codebook and no per block scale factors are needed. This is why Stage 1 can use a single fixed grid and still get close to the theoretical best.
03

Rate-Distortion Theory

Intuition When we compress a song to MP3, we accept some loss in quality to save space. There is a theoretical limit to how good it can sound at a given file size, and no encoder can beat it. Shannon proved the same kind of limit exists for compressing numbers: at a given bit budget, there is a minimum error we are stuck with. An optimal scalar quantizer nearly hits that limit, but only if the distribution is known. The rotation from section 02 gives us a known Beta distribution, which makes this possible.

Every bit we spend per number buys us more accuracy. For Gaussian distributed data under squared error, Shannon's theorem tells us exactly how much: double the bits, quarter the error. No scheme can beat this floor for that setting.

Shannon's compression floor \(D^*(R) = \sigma^2 \cdot 2^{-2R}\)

How close does optimal scalar quantization get?
\(\;D_{\text{scalar}} \leq \dfrac{\pi e}{6} \cdot D^*(R) \approx D^*(R) + 1.53\;\text{dB}\)

Within a factor of ~1.5 dB of the theoretical best. This is a classical result for data with a known distribution, and should be read as an asymptotic benchmark rather than an exact guarantee at every bit rate.

For data with an unknown or spiky distribution (like raw activation vectors), scalar quantization does much worse. The rotation from section 02 reshapes the data into a known Beta distribution so an optimal quantizer can be designed and this near-optimal guarantee kicks in.

V3 How much error does each bit budget buy?

Green: Shannon's floor, the best any scheme can do. Blue: optimal scalar quantization after rotation, close to optimal. Pink: scalar quantization on the original spiky data, noticeably worse. The orange vertical marks one TurboQuant operating point (3.5 bits per dimension).

Connection to TurboQuant & attention savings The efficiency claim: Shannon's formula above gives the theoretical floor for any quantizer. TurboQuant's paper proves its own MSE bound: \(D_{\text{mse}} \leq \sqrt{3\pi/2}\cdot 4^{-b}\), against a lower bound of \(4^{-b}\) for any \(b\)-bit quantizer — a gap of about \(2.17\times\). At 3 to 3.5 bits per dimension this is near-optimal up to a small constant factor. For context, FP16 uses 16 bits per dimension.
04

Johnson-Lindenstrauss Lemma

Intuition We can take a million points in an enormous space and project them down to a few hundred dimensions, and all the distances between them barely change. The number of dimensions we need depends only on how many points there are, not on how large the original space was. A random matrix does the job.
Johnson-Lindenstrauss lemma For any \(n\) points in \(\mathbb{R}^d\) and \(\varepsilon \in (0,1)\), there exists a linear map \(f: \mathbb{R}^d \to \mathbb{R}^k\) with \(k = O\!\left(\frac{\log n}{\varepsilon^2}\right)\) such that for all pairs \(i,j\): \[(1-\varepsilon)\|x_i - x_j\|^2 \;\leq\; \|f(x_i)-f(x_j)\|^2 \;\leq\; (1+\varepsilon)\|x_i-x_j\|^2\] And a random Gaussian matrix \(J\) with entries \(\sim\mathcal{N}(0,1/k)\) is such a map with high probability. Note: \(d\) does not appear in the bound on \(k\).
Point 1: High dimensions are mostly "equator"

Pick two random arrows in a high dimensional space. In 2D, the angle between them could be anything. In high dimensions, it is almost always close to 90°. This is why random projections work: the projection directions do not pile up, so the projection sees the data evenly.

JL-A Angles between random vectors converge to 90° in high dimensions

Histogram of angles between random pairs of arrows. Drag d up and watch the distribution collapse onto 90°. In 2D the angle could be anything; by d=100 nearly every pair is almost perpendicular.

Point 2: More projections, better accuracy

Like asking \(k\) independent surveyors to estimate the same distance: with few the estimate is noisy, with many it locks in. The failure probability drops exponentially:

\[\mathbb{P}\!\left(\left|\|Jx\|^2 - \|x\|^2\right| > \varepsilon\|x\|^2\right) \;\leq\; 2\exp\!\left(-\frac{k\varepsilon^2}{4}\right)\]
JL-B How close is the projected distance to the true distance? (one pair)

Each sample is one random projection. The x axis shows the ratio of projected distance to true distance (1.0 means perfect). Orange lines mark the accuracy bounds. Increase k and watch the distribution collapse around 1. More projection dimensions means tighter estimates.

Point 3: Guaranteeing all pairs at once

For \(n\) points there are \(\sim n^2/2\) pairs. Sum the failure chances from Point 2 over all pairs:

\[n^2 \cdot e^{-k\varepsilon^2/4} \;\leq\; \delta \quad\Longrightarrow\quad k \;\geq\; \frac{8\log n + 4\log(1/\delta)}{\varepsilon^2} = O\!\left(\frac{\log n}{\varepsilon^2}\right)\]

The original dimension \(d\) never appears. The \(\log n\) pays for protecting all pairs; the \(1/\varepsilon^2\) pays for tightness.

JL-C How many projection dimensions do we need?

Green: chance that any pair fails. Orange: 1% safety threshold. Try adding more points or tightening the accuracy to see how the required number of projection dimensions changes. The required dimensions grow with the logarithm of the number of points, not proportionally.

The punchline The required projection dimension grows only logarithmically in the number of points, regardless of the original dimension. A million point cloud could live in ten dimensions or ten million; the projection size barely changes. Only the number of points and the desired accuracy matter.
Connection to TurboQuant JL is the "random projections are trustworthy" guarantee. It tells us geometry survives the projection; the arc-cosine identity in section 05 explains why one bit per dimension is enough.
05

Arc-Cosine Identity for Gaussian Projections

Intuition Draw a random line through the origin. Two vectors either land on the same side or opposite sides. The chance they agree depends only on the angle between them. Similar vectors almost always agree; perpendicular ones agree half the time. A single yes/no bit carries real angular information.
Arc-cosine identity (exact for Gaussian random hyperplanes, any d) \(\mathbb{P}\!\left(\mathrm{sign}(j^\top r) = \mathrm{sign}(j^\top q)\right) = 1 - \frac{\theta(r,q)}{\pi}\)

where \(\theta(r,q) = \arccos\!\left(\frac{\langle r,q\rangle}{\|r\|\|q\|}\right)\). The key mixed expectation:

\(\mathbb{E}\!\left[\mathrm{sign}(j^\top r)\cdot(j^\top q)\right] = \sqrt{\dfrac{2}{\pi}}\,\dfrac{\langle r,\,q\rangle}{\|r\|}\)

Encoding: project the leftover error from Stage 1 onto many random directions and store just the sign (+1 or -1) of each. That is all we keep.

Decoding (query time): recreate the same directions from the shared seed and project the query onto each one. Multiply each stored sign by the query's projection in that direction, then average all the products. That average estimates similarity, off by a known scaling factor that we simply multiply away.

V5 How often do two vectors land on the same side of a random line?

Green curve: exact formula. Blue dots: measured from random projections. The match is exact in expectation. Increase samples and the dots collapse onto the curve.

Connection to TurboQuant Suppose the true similarity between a residual and a query is 0.7. A single sign bit just says +1 or -1, which is crude. But TurboQuant stores one bit per dimension, so for a 128 dimensional vector that is 128 sign bits. We multiply each stored sign by the query's projection in that direction, then average all 128 products. After correcting for the known scaling factor, we might get 0.69. That is Stage 2: a cheap correction built from signs that closes the gap left by Stage 1's quantization.
06

Law of Large Numbers

Intuition Ask one person to guess the height of a building and the answer could be way off. Ask ten and average their guesses, and we get closer. Ask a hundred and the average is remarkably accurate. Each random direction works the same way: one direction gives a noisy reading, but averaging many locks the estimate in near the truth.
Estimator accuracy: more bits, less noise \(\hat{\ell}_m = \frac{1}{m}\sum_{i=1}^m \mathrm{sign}(j_i^\top r)\cdot(j_i^\top q)\)

\(\mathbb{E}[\hat{\ell}_m] = \sqrt{\tfrac{2}{\pi}}\,\dfrac{\langle r,q\rangle}{\|r\|} \qquad\text{(unbiased up to known constant)}\)

\(\mathrm{Var}[\hat{\ell}_m] \;\propto\; \dfrac{1}{m}\)
V6 How fast does noise shrink as we add more bits?

Blue: measured noise. Orange: theory (double the bits, halve the noise). The slopes match.

Connection to TurboQuant Stage 1 captures most of the signal but leaves a small residual. Stage 2 stores one sign bit per dimension to compress that residual. Each bit is a noisy estimate of the same similarity. By the law of large numbers, averaging all d of them shrinks the variance by a factor of d (so the standard deviation shrinks by a factor of √d). This is why even one bit per dimension is enough to make Stage 2 error much smaller than Stage 1 quantization error.
07

Shared Randomness

Intuition Agree on a recipe (a random seed) in advance. Both sides regenerate the same rotation and projection from it. The seed is a small fixed size (a typical implementation uses 32 bytes); everything else is reconstructed on the fly.
What must be stored per KV vector \(\underbrace{\mathrm{quant}(Rx)}_{b\;\text{bits/dim}} \;\|\; \underbrace{\mathrm{sign}(Jr)}_{1\;\text{bit/dim}} \;\|\; \underbrace{\|x\|}_{\text{1 scalar}}\)

\(R,\, J\): stored once as a random seed (small, fixed size).
V7 What gets stored per vector?

Stored per vector. Regenerated from seed (free). Extra metadata competitors need. TurboQuant eliminates the orange column.

Connection to TurboQuant Traditional methods typically need per-vector metadata (scale factors, codebook indices). Because TurboQuant's random transforms are data-oblivious and shared across vectors, it avoids this kind of per-vector calibration overhead.

How They Chain Together

None of these are new mathematics. The elegance of TurboQuant is composing seven classical results in exactly the right order.

Chain Logical dependency graph

Putting It All Together

Now that we have seen the seven ingredients, here is the full recipe.

TurboQuant in five steps
  1. Generate shared random operators \(R\) and \(J\) from a seed.
  2. Rotate the KV vector: \(y = Rx\).
  3. Quantise \(y\) coordinatewise with a fixed scalar grid to obtain Stage 1.
  4. Form the residual \(r = x - R^\top \tilde y\), then store \(\mathrm{sign}(Jr)\) as Stage 2.
  5. At query time, combine the Stage 1 reconstruction with a calibrated estimator from \(\mathrm{sign}(Jr)\) and \(Jq\) to approximate \(\langle x,q\rangle\).

References & further reading