A first principles proof of concentration of lipschitz functions on the sphere (and maybe Gaussian) based on rotational symmetry
One of the most useful concentration inequalities I learned back in my probability class is the phenomenon of concentration of Lipschitz functions. For example, it can be used to show that, in understanding the maximum of a Gaussian process, the hard part is really just to characterize the expected supremum, and a high probability bound just falls out. It's also used in a proof of the johnson-lindenstrauss-lemma.
The proof we did in class relied as a black box on isoperimetric inequalities for the sphere and the gaussian. (From now on, I'll stick to talking about the sphere.) Informally, this says that, fixing a (hyper)area, the spherical caps are the subsets of the sphere with the smallest blowups ($\epsilon$ sized neighborhoods). If you fix your area to be half the sphere, this says that a hemisphere is the set which minimizes the size of a blowup. But it turns out that the size of this blowup is still really big. In particular, an $\epsilon$ blowup will have take up something like $1-2e^{-c\epsilon^2}$ of the sphere. The crucial observation is that one can associate to a Lipschitz function $f$ it's lower level set (let's call this set $L(f)$): the points for which s.t. $f(x)\leq M$ (where $M$ here denotes its median). The size of the $\epsilon$ blowup of $L(f)$ is then precisely the probability that $f(x) - M \leq \epsilon$. Since $M$ was taken to be the median, $L(f)$ has area at least half the sphere, allowing us to apply the isoperimetric inequality and blowup bound from before.
On the one hand, I really liked this approach, as it gave a very clean conceptual picture. Fitting well into the broader picture of mixing, isoperimetry, and concentration.
However, it still feels a little unsatisfying, as it felt like we didn't actually really prove anything... just mapped from one problem to pretty much an equivalent problem.
Recently my mind somehow circled back to this (I've been learning about random unitaries in 8.372 - Quantum Information 3 and was wondering about other applications of rotational symmetry), and I realized that there really is a very simple proof.
Let's do it
Theorem statement: Let $f:\sqrt{n}S^{n-1}\rightarrow \bC$ be an $L$ Lipschitz function.
$$\P_{X\sim \sqrt{n}S^{n-1}}[|f(X) - \E f|\geq \epsilon]\leq e^{-c \frac{\eps^2}{L^2}}$$
Here, the $\sqrt{n}$ radius is the right normalization for this to be analogous to the $n$ dimensional gaussian with identity covariance.
Note that we can take $f$ to be 1-Lipschitz and smooth WLOG (1 Lipschitz because we can just absorb a constant factor in the $\epsilon$, and smooth because 1-Lipschitz functions are smooth almost everywhere).
Proof idea.
I'm going to assume that we already know how things work in the linear case (which also turns out to be the worst case). That is, if $f$ is a linear function, the variable $f(X)$ looks essentially like a standard Gaussian with variance 1 (more formally, it's sub-Gaussian).
The idea is to use rotational symmetry to decompose the general problem into many small instances of the linear problem.
Proof.
We bound instead the quantity $\P[|f(X)-f(Y)|\geq \epsilon]$ for $X$ and $Y$ independent copies, which is equivalent up to some constant factors.
Next, we draw a geodesic curve between $X$ and $Y$, of length $\ell$. Let's call this curve $Z_t$ where $Z_0=X$ and $Z_\ell=Y$.
Observations
- The length $\ell$ of the curve $Z_t$ is order $\sqrt{n}$ (with high probability it'll be very close to $\frac{\pi}{2}\sqrt{n}$, but since we're going to be loose here with constants, we'll just upper bound this by $\pi \sqrt{n}$ which is the maximum geodesic distance between any two points on the sphere)
- Each $Z_t$ is marginally a uniform point on the sphere
- Also, each pair $Z_t,Z_{t+dt}$ is marginally a uniform point with a small step in a uniformly random direction.
We proceed by splitting up this curve into $T$ tiny segments of length $dt=\ell/T=O\left( \frac{\sqrt{n}}{T} \right)$ (formally, we'll be taking $T$ to infinity) so that $f(Z_{t+dt})\approx f(Z_t) + \langle\nabla f(Z_t), Z_{t+dt}-Z_t\rangle$.
Let's think about what each of these differences
$$\langle\nabla f(Z_t), Z_{t+dt}-Z_t\rangle$$
looks like.
- We know that $|\nabla f(Z_t)|_2\leq 1$ by Lipschitzness.
- By rotational symmetry, $Z_{t+dt}-Z_t$ is marginally going to look like a random vector of length $dt$.
We have successfully reduced the problem to the linear case as promised!
The remainder will be a little heuristic to emphasize intuition (though should be easily formalizable).
The inner product from above is the projection of a random point on a sphere of radius $dt$ onto a unit direction. This is essentially going to be a Gaussian with standard deviation $\frac{1}{\sqrt{n}}dt = O\left( \frac{1}{T} \right)$. That is:
$$f(Z_{t+dt})-f(Z_t)\sim \calN\left( 0, \frac{1}{T^2} \right)$$
So that
$$f(Y)-f(X)=\sum_{i=0}^{T} f(Z_{(i+1)dt})-f(Z_{idt})$$
is the sum of $T$ Gaussians, each with variance order $\frac{1}{T^2}$, which is itself going to be a Gaussian of variance $\sigma^2\leq O(1)$. Therefore $$f(Y)-f(X)\sim \calN(0,O(1))$$
And we're basically done! We can just apply concentration of gaussian random variables to conclude that
$$\P[|f(X)-f(Y)|\geq \epsilon]\lesssim \exp\lbrb{c\epsilon^2}$$for some universal constant $c>0$ that one could figure out by going through these steps more carefully.
Thoughts/discussion
- This shows why linear functions maximize the probability $P[|f(X)-f(Y)|\geq \epsilon]$. Since in this case, each of the $T$ mini Gaussians we're summing over will be perfectly correlated, which maximizes the variance of the sum.
- The same approach can be used to prove the Gaussian version of this theorem.
- $X$ and $Y$ will be independent samples from the gaussian.
- $Z_t=\cos(t)X + \sin(t)Y$ so that marginally, each $Z_t$ is a uniformly random Gaussian, and each $Z_{d+dt}-Z_t$ is an gaussian with variance $dt^2$
- Where here $dt=1/T$ so once again we have $T$ Gaussians each of variance $1/T$
Followups/remaining questions
- Can this approach be used to prove the isoperimetric inequality?
- I wonder if you can do something similar with the hypercube as well. Take two samples, look at the line between the samples. The direction of this line is like a random hypercube point.
- The issue is if you can say that the gradient of $f$ at a point inside the hypercube is independent of the direction you're stepping in... which is definitely not true