A first principles proof of concentration of lipschitz functions on the sphere (and maybe Gaussian) based on rotational symmetry

#writing/math #math

One of the most useful concentration inequalities I learned back in my probability class is the phenomenon of concentration of Lipschitz functions. For example, it can be used to show that, in understanding the maximum of a Gaussian process, the hard part is really just to characterize the expected supremum, and a high probability bound just falls out. It's also used in a proof of the johnson-lindenstrauss-lemma.

The proof we did in class relied as a black box on isoperimetric inequalities for the sphere and the gaussian. (From now on, I'll stick to talking about the sphere.) Informally, this says that, fixing a (hyper)area, the spherical caps are the subsets of the sphere with the smallest blowups ($\epsilon$ sized neighborhoods). If you fix your area to be half the sphere, this says that a hemisphere is the set which minimizes the size of a blowup. But it turns out that the size of this blowup is still really big. In particular, an $\epsilon$ blowup will have take up something like $1-2e^{-c\epsilon^2}$ of the sphere. The crucial observation is that one can associate to a Lipschitz function $f$ it's lower level set (let's call this set $L(f)$): the points for which s.t. $f(x)\leq M$ (where $M$ here denotes its median). The size of the $\epsilon$ blowup of $L(f)$ is then precisely the probability that $f(x) - M \leq \epsilon$. Since $M$ was taken to be the median, $L(f)$ has area at least half the sphere, allowing us to apply the isoperimetric inequality and blowup bound from before.

On the one hand, I really liked this approach, as it gave a very clean conceptual picture. Fitting well into the broader picture of mixing, isoperimetry, and concentration.

However, it still feels a little unsatisfying, as it felt like we didn't actually really prove anything... just mapped from one problem to pretty much an equivalent problem.

Recently my mind somehow circled back to this (I've been learning about random unitaries in 8.372 - Quantum Information 3 and was wondering about other applications of rotational symmetry), and I realized that there really is a very simple proof.

Let's do it

Theorem statement: Let $f:\sqrt{n}S^{n-1}\rightarrow \bC$ be an $L$ Lipschitz function.
$$\P_{X\sim \sqrt{n}S^{n-1}}[|f(X) - \E f|\geq \epsilon]\leq e^{-c \frac{\eps^2}{L^2}}$$

Here, the $\sqrt{n}$ radius is the right normalization for this to be analogous to the $n$ dimensional gaussian with identity covariance.

Note that we can take $f$ to be 1-Lipschitz and smooth WLOG (1 Lipschitz because we can just absorb a constant factor in the $\epsilon$, and smooth because 1-Lipschitz functions are smooth almost everywhere).

Proof idea.
I'm going to assume that we already know how things work in the linear case (which also turns out to be the worst case). That is, if $f$ is a linear function, the variable $f(X)$ looks essentially like a standard Gaussian with variance 1 (more formally, it's sub-Gaussian).

The idea is to use rotational symmetry to decompose the general problem into many small instances of the linear problem.

Proof.
We bound instead the quantity $\P[|f(X)-f(Y)|\geq \epsilon]$ for $X$ and $Y$ independent copies, which is equivalent up to some constant factors.

Next, we draw a geodesic curve between $X$ and $Y$, of length $\ell$. Let's call this curve $Z_t$ where $Z_0=X$ and $Z_\ell=Y$.
Observations

We proceed by splitting up this curve into $T$ tiny segments of length $dt=\ell/T=O\left( \frac{\sqrt{n}}{T} \right)$ (formally, we'll be taking $T$ to infinity) so that $f(Z_{t+dt})\approx f(Z_t) + \langle\nabla f(Z_t), Z_{t+dt}-Z_t\rangle$.
Let's think about what each of these differences
$$\langle\nabla f(Z_t), Z_{t+dt}-Z_t\rangle$$
looks like.

We have successfully reduced the problem to the linear case as promised!

The remainder will be a little heuristic to emphasize intuition (though should be easily formalizable).

The inner product from above is the projection of a random point on a sphere of radius $dt$ onto a unit direction. This is essentially going to be a Gaussian with standard deviation $\frac{1}{\sqrt{n}}dt = O\left( \frac{1}{T} \right)$. That is:
$$f(Z_{t+dt})-f(Z_t)\sim \calN\left( 0, \frac{1}{T^2} \right)$$
So that
$$f(Y)-f(X)=\sum_{i=0}^{T} f(Z_{(i+1)dt})-f(Z_{idt})$$
is the sum of $T$ Gaussians, each with variance order $\frac{1}{T^2}$, which is itself going to be a Gaussian of variance $\sigma^2\leq O(1)$. Therefore $$f(Y)-f(X)\sim \calN(0,O(1))$$
And we're basically done! We can just apply concentration of gaussian random variables to conclude that
$$\P[|f(X)-f(Y)|\geq \epsilon]\lesssim \exp\lbrb{c\epsilon^2}$$for some universal constant $c>0$ that one could figure out by going through these steps more carefully.

Thoughts/discussion

Followups/remaining questions