In this lecture, we build intuition on geometric properties of bodies and random variables in high dimensions.
We first start by discussing a striking (but not so hard to prove) property of high-dimensional objects: their volume is concentrated near the surface! You probably remember that the volume of a 3-dimensional cube with side $r$ is $r^3.$ In $d$ dimensions, the volume of a ($d$-dimensional, hyper)cube is $r^d$. Let us consider now what happens when we scale the cube down by scaling each side by a factor $1 - \epsilon,$ for some $\epsilon \in (0, 1).$ Since the new side of the cube is $(1 - \epsilon)r,$ the new volume is
$$ ((1 - \epsilon) r)^d = (1-\epsilon)^d r^d.$$Thus, the volume of the cube scaled by a factor $1-\epsilon$ is smaller by (an exponential!) factor $(1 - \epsilon)^d$. If we take $\epsilon = t/d$ for some constant $ t \in (1, d),$ this tells us that only $(1 - t/d)^d \leq \exp(-t)$ fraction of the volume is contained "away from the boundary," in the inscribed cube of side $(1-t/d)r.$ Equivalently, most of the volume of the high-dimensional cube lies near the boundary, in the region of width of the order $1/d.$ As a specific example, for $t=10,$ more than $1 - \exp(-10) \approx 0.99995 = 99.995\%$ of the volume is within the factor $(1-10/d)$ of the boundary.
import numpy as np
t = 10
1 - np.exp(-t)
0.9999546000702375
This same argument can be used to establish that the same property $-$ that the volume concentrates close to the boundary $-$ holds for arbitrary high-dimensional geometric bodies. To see this, consider partitioning a high dimensional geometric body into infinitesimal cubes of the same dimension, as you would do when computing Riemannian integrals, and apply the reasoning we used for the high-dimensional cube. As a consequence, we can conclude that if a high-dimensional vector is generated uniformly at random from a geometric body of the same dimension, with high probability it will lie near the boundary. (Here, "uniform" means using the probability density function equal to the inverse of the volume of the body inside the body and equal to zero outside.) In particular, in high dimensions, points generated uniformly at random from a ball of radius $r$, which contains all vectors $\mathbf{x} = [x_1, x_2, \dots, x_d]^\top$ such that $\|\mathbf{x}\|_2^2 = x_1^2 + x_2^2 + \dots + x_d^2 \leq r^2,$ concentrate near the surface of the ball, in the sense that any point generated from the ball uniformly at random has distance that is between $(1 - 10/d)r$ and $r$ with probability at least $0.99995$.
You are probably familiar with the "bell-shaped" curves corresponding to Gaussian random variables in one dimension (we have also seen an illustration in the first lecture note). What these curves tell us is that in one dimension, randomly generated points from a Gaussian distribution will mostly lie close around the "mode" (in this case also the mean) of the distribution. Let us see an illustration of that below, where we plot the pdf of the standard normal distribution $\mathcal{N}(0, 1)$ and 50 randomly generated points from this distribution.
from scipy.stats import norm # this is imported to plot the pdf of normal distribution
import matplotlib.pyplot as plt
%matplotlib inline
n = 50 # the number of randomly generated points
r = norm.rvs(size = n) # random samples from the standard normal distribution
x = np.linspace(-5, 5, 100) # choose the x-axis values for the plot
# Plot
plt.plot(x, norm.pdf(x, 0, 1)) # plot the pdf
plt.scatter(r, np.zeros(n), marker='x', color='red') # plot the randomly generated points
plt.show()
However, as we increase the dimension of a Gaussian random variable, its geometric properties change dramatically. As the dimension $d$ of a standard normall high-dimensional random vector $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ increases, fewer randomly generated points (vectors) from this distribution are likely to land close to the mean. Instead, the random variables strongly concentrate around a sphere of radius $\sqrt{d}.$ (Recall also that standard normal variables are spherically symmetric $-$ all random vectors of the same magnitude are equally likely.) In that sense, high-dimensional standard normal random variables behave similarly to uniformly random variables from a high-dimensional sphere or a ball (both of radius $\sqrt{d}$), which is the reason why all three of these distributions feature prominently in the analysis of randomized algorithms, with the analysis often being exchangeable between the three specific distributions. But we are digressing... Let us now formally argue what we claimed above: that high-dimensional (standard normal) Gaussians concentrate around the sphere of radius $\sqrt{d}.$
Recall the definition of the Euclidean (or $\ell_2$) norm of a vector: $\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^d x_i^2},$ where $x_1, x_2, \dots, x_d$ are the entries of $\mathbf{x}$, which, as we discussed before has the interpretation of what we intuitively understand as the length of a vector. We prove the following:
Proof To prove the theorem, we use the Chernoff method. You will recall that Chernoff method boils down to finding a moment generating function of a random variable and using Markov's inequality. The question here is: which random variable should we work with?
The somewhat obvious choice is to try to directly bound $\|\mathbf{z}\|_2$, but it is not entirely clear how to do that. However, if we square it, then we get a sum of squares of independent univariate Gaussians: $\|\mathbf{z}\|_2^2 = z_1^2 + z_2^2 + \dots + z_d^2,$ where each $z_i \sim \mathcal{N}(0, 1).$ Observe that $\mathbb{E}[z_i^2] = 1,$ because $z_i \sim \mathcal{N}(0, 1),$ for all $i = 1, 2, \dots, d,$ and it helps to center (make expectation equal to zero) each $z_i^2$ by considering $z_i^2 - 1$. Now we can use independence to conclude that
$$ \mathbb{E}_{\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}[e^{\lambda (\sum_{i=1}^d(z_i^2 - 1))}] = \mathbb{E}_{z_1 \sim \mathcal{N}(0, 1), \dots, z_d \sim \mathcal{N}(0, 1) }[e^{\lambda (z_1^2-1)}e^{\lambda (z_2^2 - 1)}\cdots e^{\lambda (z_d^2-1)}] = \Big(\mathbb{E}_{z_1 \sim \mathcal{N}(0, 1)}[e^{\lambda (z_1^2-1)}] \Big)^d. $$Now we can focus on finding (or bounding) $\mathbb{E}_{z_1 \sim \mathcal{N}(0, 1)}[e^{\lambda (z_1^2-1)}]$. Writing this expectation explicitly by recalling the formula for the pdf of a standard normal random variable, we have
\begin{align} \mathbb{E}_{z_1 \sim \mathcal{N}(0, 1)}[e^{\lambda (z_1^2 - 1)}] &= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{+\infty} e^{\lambda (z_1^2 - 1)} e^{-\frac{z_1^2}{2}}dz_1 \\ &= e^{-\lambda}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{+\infty} e^{-\frac{z_1^2(1 - 2\lambda)}{2}}dz_1. \end{align}If $1 - 2\lambda > 0$, then $\frac{\sqrt{1-2\lambda}}{\sqrt{2\pi}} e^{-\frac{z_1^2(1 - 2\lambda)}{2}}$ is the pdf of a Gaussian random variable with mean zero and variance $1/(1-2\lambda).$ If $1 - 2\lambda \leq 0,$ the above integral is infinite. Thus, we conclude that for $\lambda < \frac{1}{2},$ we have $\mathbb{E}_{z_1 \sim \mathcal{N}(0, 1)}[e^{\lambda (z_1^2-1)}] = \frac{e^{-\lambda}}{\sqrt{1-2\lambda}}.$ We could now try applying Markov inequality to $e^{\lambda (\|\mathbf{z}\|_2^2-d)}$ for $\lambda \in (0, 1/2),$ using
$$ \mathbb{E}_{\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}[e^{\lambda (\|\mathbf{z}\|_2^2-d)}] = \frac{e^{-\lambda d}}{(1-2\lambda)^{d/2}}. $$However, this would only allow us to argue that $\|\mathbf{z}\|_2 \leq c \sqrt{d}$ with high probability for some constant $c > 1,$ which is weaker than what we are trying to prove. (You are encouraged to try this on your own.)
A key insight to prove the stated result is the following inequality that we state here without proof:
$$ \frac{e^{-\lambda}}{\sqrt{1 - 2\lambda}} \leq e^{2\lambda^2}, \quad \forall \lambda: |\lambda|\leq 1/4. $$Thus, we can apply Markov's inequality to the random variable $e^{\lambda (\|\mathbf{z}\|_2^2 - d)}$ (in the same way we have used the Chernoff method) but now we need to ensure that $\lambda \leq 1/4.$ Note that by what we have shown above, we have that for $|\lambda| \leq 1/4,$
$$ \mathbb{E}_{\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})}[e^{\lambda (\|\mathbf{z}\|_2^2-d)}] \leq e^{2d\lambda^2}. $$Before proceeding, recall (from the statement of the theorem) that we are interested in bounding the probability that $|\|\mathbf{z}\|_2 - \sqrt{d}| \geq t$ for $t \in (0, \sqrt{d}]$. Multiplying both sides by $\|\mathbf{z}\|_2 + \sqrt{d}$, this is the same as bounding the probability that $|\|\mathbf{z}\|_2^2 - {d}| \geq t(\|\mathbf{z}\|_2 + \sqrt{d}).$ Because this inequality is guaranteed to hold whenever $|\|\mathbf{z}\|_2^2 - {d}| \geq t\sqrt{d},$ it suffices to bound above the probability that $|\|\mathbf{z}\|_2^2 - {d}| \geq t\sqrt{d}.$
It remains now to apply Markov inequality to both $e^{\lambda(\|\mathbf{z}\|_2^2 - d)} \geq e^{\lambda t\sqrt{d}}$ and $e^{-\lambda(\|\mathbf{z}\|_2^2 - d)} \geq e^{\lambda t\sqrt{d}}$ for $\lambda \in (0, 1/4]$ to bound the probabilities that $\|\mathbf{z}\|_2^2 - d \geq t\sqrt{d}$ and $-(\|\mathbf{z}\|_2^2 - d) \geq t\sqrt{d}$, respectively, and obtain a two-sided tail bound. In particular, we have that
\begin{align} \mathbb{P}[|\|\mathbf{z}\|_2^2 - {d}| \geq t\sqrt{d}] &= \mathbb{P}[e^{\lambda(\|\mathbf{z}\|_2^2 - d)} \geq e^{\lambda t\sqrt{d}}] + \mathbb{P}[e^{-\lambda(\|\mathbf{z}\|_2^2 - d)} \geq e^{\lambda t\sqrt{d}}] \\ &\leq 2 e^{2d\lambda^2 - \lambda t\sqrt{d}}. \end{align}As we saw before, the above probability is minimized when $2d\lambda^2 - \lambda t\sqrt{d}$ is minimized, which happens for $\lambda = \frac{t}{4\sqrt{d}}.$ Since $t \in (0, \sqrt{d}],$ this choice also guarantees that $\lambda \in (0, 1/4],$ so the above bound applies. Plugging in this choice, we get
$$ \mathbb{P}[|\|\mathbf{z}\|_2^2 - {d}| \geq t\sqrt{d}] \leq 2 e^{-t^2/8}, $$which is exactly what was claimed in the theorem statement. $\blacksquare$
In the above theorem, we used but did not prove the inequality $\frac{e^{-\lambda}}{\sqrt{1 - 2\lambda}} \leq e^{2\lambda^2}$ for $|\lambda| \leq 1/4. $ Although this is not the proof, we can use Python to plot the function $\frac{e^{-\lambda}}{\sqrt{1 - 2\lambda}} - e^{2\lambda^2}$ for $|\lambda| \leq 1/4$ and try to convince ourselves that the inequality we used is correct.
lambda_val = np.linspace(-1/3, 1/3, 100) # Generate values for lambda within the specified range
# Calculate corresponding function values for the inequality
f_lambda = np.exp(-lambda_val) / np.sqrt(1 - 2 * lambda_val) - np.exp(2 * lambda_val**2)
# Plot the function
plt.plot(lambda_val, f_lambda)
plt.xlabel(r'$\lambda$')
plt.ylabel(r'$\frac{e^{-\lambda}}{\sqrt{1 - 2\lambda}} - e^{2\lambda^2}$')
plt.grid(True)
plt.show()
As a sanity check, we can also calculate the maximum of the above function.
print(np.max(f_lambda))
-1.1285751112311004e-05
We now discuss another surprising fact about Gaussians: we can generate exponentially many (in dimension $d$) independent Gaussian random vectors and guarantee with constant probability that they are all near-orthogonal (recall that in a $d$-dimensional space there can be at most $d$ vectors that are exactly orthogonal). To argue this, let us first remind ourselves that for two vectors $\mathbf{x}$ and $\mathbf{y},$ the cosine of the angle between them, which we denote by $\alpha(\mathbf{x}, \mathbf{y}),$ satisfies:
$$ \cos(\alpha(\mathbf{x}, \mathbf{y})) = \frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2}. $$Thus, two vectors are orthogonal if $\mathbf{x}^\top \mathbf{y} = 0.$ We will say that they are near-orthogonal when $\frac{\mathbf{x}^\top \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2}$ is small, which aligns with the geometric intuition (note that 2 vectors define a two-dimensional space $-$ a plane $-$ so it suffices to understand angles between two vectors by looking at things in 2D).
Let us first consider what happens when we consider an angle between a random Gaussian vector $\mathbf{z}\sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and an arbitrary but fixed vector $\mathbf{x}$ of length $l = \|\mathbf{x}\|_2.$ We will show below that they are near-orthogonal with high probability.
Proof We use Chernoff's method again to prove this lemma. To do so, we compute $\mathbb{E}[e^{\lambda \mathbf{z}^\top \mathbf{x}}].$ As before, we notice that the entries of $\mathbf{z}$ are independent, and using that $\mathbf{z}^\top \mathbf{x} = \sum_{i=1}^d z_i x_i,$ we can write
$$ \mathbb{E}[e^{\lambda \mathbf{z}^\top \mathbf{x}}] = \mathbb{E}[e^{\lambda z_1 x_1}e^{\lambda z_2 x_2}\cdots e^{\lambda z_d x_d}] = \mathbb{E}[e^{\lambda z_1 x_1}]\mathbb{E}[e^{\lambda z_2 x_2}]\cdots \mathbb{E}[e^{\lambda z_d x_d}]. $$We have already computed the MGF of a univariate standard Gaussian before, so to avoid repetition, we observe that in this case instead of being multiplied by $\lambda,$ each Gaussian $z_i$ for $i = 1, 2, \dots, d$ is multiplied by $\lambda x_i,$ which is just another constant as all entries $x_i$ are fixed. Thus, we can conclude that
$$ \mathbb{E}[e^{\lambda z_i x_i}] = e^{\frac{\lambda^2 x_i^2}{2}}, \;\; \forall i = 1, 2, \dots, d, $$and, as a result, we can further conclude that
$$ \mathbb{E}[e^{\lambda \mathbf{z}^\top \mathbf{x}}] = e^{\frac{\lambda^2 (x_1^2 + x_2^2 + \dots + x_d^2)}{2}} = e^{\frac{\lambda^2 l^2}{2}}, $$where we used that $\|\mathbf{x}\|_2^2 = x_1^2 + x_2^2 + \dots + x_d^2 = l^2,$ which holds by the lemma assumption. We can conclude that $\mathbf{z}^\top \mathbf{x}$ is $l^2$-subgaussian and apply the Chernoff bound we had derived before.
In particular, we apply Markov inequality to events $e^{\lambda \mathbf{z}^\top \mathbf{x}} \geq e^{\lambda t}$ and $e^{-\lambda \mathbf{z}^\top \mathbf{x}} \geq e^{\lambda t}$ for $\lambda > 0$ and then minimize the resulting probability over $\lambda,$ as we have done before. In particular,
$$ \mathbb{P}[|\mathbf{z}^\top \mathbf{x}| \geq t] = \mathbb{P}[e^{\lambda \mathbf{z}^\top \mathbf{x}} \geq e^{\lambda t}] + \mathbb{P}[e^{-\lambda \mathbf{z}^\top \mathbf{x}} \geq e^{\lambda t}] \leq 2 e^{\frac{\lambda^2 l^2}{2} - \lambda t}. $$We have seen before (by using the derivative test) that $\lambda = \frac{t}{l^2}$ minimizes the above probability, and plugging in this choice, we thus conclude that
$$ \mathbb{P}[|\mathbf{z}^\top \mathbf{x}| \geq t] \leq 2 e^{- \frac{t^2}{2l^2}}, $$as was claimed. $\blacksquare$
Observe also that $\mathbf{z}^\top \mathbf{x}$ is a univariate Gaussian with mean $\mu = 0$ and variance $\sigma^2 = l^2$ (how would you conclude this?). Thus, we could have also proved the lemma above by directly applying the concentration bound for univariate Gaussians that we proved in the previous lecture.
Here, we are actually interested in bounding the probability that the angle between two Gaussians is large. More specifically, we want to bound the probability $\mathbb{P}\big[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\big]$ for $t > 0.$ First, we know from what we saw earlier in this lecture that Gaussians concentrate strongly around a sphere of radius $\sqrt{d}.$ As a result, we can argue that $\mathbb{P}\big[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\big] \approx \mathbb{P}[{|\mathbf{z}_1^\top \mathbf{z}_2|} \geq t d]$. To bound the probability $\mathbb{P}[{|\mathbf{z}_1^\top \mathbf{z}_2|} \geq t d]$, we can either condition on one of the two vectors to keep it fixed or we can compute the MGF of $\mathbf{z}_1^\top \mathbf{z}_2$ and apply Chernoff method again. We dicuss briefly here how we would apply the Chernoff method. We observe first that $\mathbf{z}_1^\top \mathbf{z}_2$ is the sum of the products of $d$ pairs of independent standard 1D Gaussians. The MGF of the product of two Gaussians is $\frac{1}{\sqrt{1- \lambda^2}}$ for $|\lambda|<1$ (prove this as an exercise). Thus, we can argue that for $|\lambda|<1$,
$$ \mathbb{E}[e^{\lambda \mathbf{z}_1^\top \mathbf{z}_2}] = \frac{1}{(1 - \lambda^2)^{d/2}}. $$We won't prove this, but it is possible to show (and you can check it visually by plotting in Python, as we did before) that $1 - \lambda^2 \geq e^{-2\lambda^2}$ for $|\lambda| \leq 0.8.$ As a consequence, we can conclude that for $|\lambda| \leq 0.8,$
$$ \mathbb{E}[e^{\lambda \mathbf{z}_1^\top \mathbf{z}_2}] \leq e^{\lambda^2 d}. $$In other words, $\mathbf{z}_1^\top \mathbf{z}_2$ is $2d$-subgaussian for sufficiently small $\lambda$. Thus, we can immediately conclude (applying the Chernoff bound we derived for subgaussian random variables or re-deriving the bound using the Chernoff method) that
As a specific example, if we take $t = \frac{10}{\sqrt{d}}$ (assuming here that the dimension is sufficiently high; e.g., at least a ~hundred), then the above probability is $\approx 2.8 \cdot 10^{-11}$.
For completeness, let us also argue about the probability that $\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t$. Here, the trick is to use the fact that both $\mathbf{z}_1$ and $\mathbf{z}_2$ concentrate strongly on a sphere of radius $\sqrt{d}.$
Proof Consider the random event $\mathcal{E} = \{\|\mathbf{z}_1\|_2 > \sqrt{d/2} \text{ and } \|\mathbf{z}_2\|_2 > \sqrt{d/2}\}$. The complement of this event is $\mathcal{E}^c = \{\|\mathbf{z}_1\|_2 \leq \sqrt{d/2} \text{ or } \|\mathbf{z}_2\|_2 \leq \sqrt{d/2}\}$. Recall that the union bound in probability tells us that the probability that either one of two (not necessarily independent) events occur equals the sum of the probabilities of the two events. In particular, we have that
$$ \mathbb{P}[\mathcal{E}^c] \leq \mathbb{P}[\|\mathbf{z}_1\|_2 \leq \sqrt{d/2}] + \mathbb{P}[\mathbf{z}_2\|_2 \leq \sqrt{d/2}] \leq 2 e^{-\frac{(\sqrt{d/2})^2}{8}} = 2e^{-\frac{d}{16}}, $$where we used the one-sided version of the "Gaussian annulus theorem" from the beginning of the lecture. In other words, when the dimension $d$ is high, the event $\mathcal{E}^c$ is highly unlikely.
To bound the probability from the statement of the lemma, we condition on the events $\mathcal{E}$ and $\mathcal{E}^c$ and use the law of total probability, which tells us that
\begin{align} \mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\bigg] &= \mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t \bigg|\mathcal{E}\bigg]\mathbb{P}[\mathcal{E}] + \mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\bigg|\mathcal{E}^c\bigg]\mathbb{P}[\mathcal{E}^c]\\ &\leq \mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\bigg|\mathcal{E}\bigg] + \mathbb{P}[\mathcal{E}^c], \end{align}where we have used the fact that (any) probability is bounded by 1. We have already bounded $\mathbb{P}[\mathcal{E}^c],$ so it remains to bound $\mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\bigg|\mathcal{E}\bigg]$. Because the event $\mathcal{E}$ posits that both $\mathbf{z}_1$ and $\mathbf{z}_2$ have magnitude higher than $\sqrt{d/2},$ we immediately conclude that
$$ \mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\bigg|\mathcal{E}\bigg] = \mathbb{P}\big[{|\mathbf{z}_1^\top \mathbf{z}_2|} \geq t {\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2}\, \big|\,\mathcal{E}\big] \leq \mathbb{P}\big[{|\mathbf{z}_1^\top \mathbf{z}_2|} \geq t d/2\big]. $$Based on the lemma we proved above, we know that for $t \in (0, 3.2]$ (or, equivalently, $t/2 \in (0, 1.6]$) we have
$$ \mathbb{P}\big[{|\mathbf{z}_1^\top \mathbf{z}_2|} \geq t d/2\big] \leq 2 e^{-\frac{t^2 d}{16}}. $$Thus, we can finally conclude that for any $t \in (0, 3.2]$,
$$ \mathbb{P}\bigg[\frac{|\mathbf{z}_1^\top \mathbf{z}_2|}{\|\mathbf{z}_1\|_2 \|\mathbf{z}_2\|_2} \geq t\bigg] \leq 2 e^{-\frac{d}{16}} + 2 e^{-\frac{t^2 d}{16}}. $$For $t \leq 1,$ the second term dominates and we reach the conclusion of the lemma. $\blacksquare$
Let us now discuss some immediate consequences of the lemma we have just proved. First, due to spherical symmetry of Gaussians (recall our first lecture note), vectors $\frac{\mathbf{z}_1}{\|\mathbf{z}_1\|_2}$ and $\frac{\mathbf{z}_2}{\|\mathbf{z}_2\|_2}$ behave like uniformly random vectors from the unit sphere. Hence we can also draw a conclusion that vectors sampled uniformly at random from the unit sphere (of a high dimension) are nearly orthogonal with high probability. A similar conclusion holds for random vectors drawn from the unit ball, as, based on what we discussed at the beginning of this lecture, the volume of the ball in high dimensions concentrates strongly near the enclosing sphere.
Now consider generating $n$ independent Gaussian vectors $\mathbf{z}_1, \mathbf{z}_2, \dots, \mathbf{z}_n$ of a high dimension $d$ and trying to understand how large $n$ can be while keeping all the vectors near-orthogonal to each other. Let $t \in (0, 1]$ be any constant, however close to zero but not zero. The lemma above bounds the probability that any pair of vectors $\mathbf{z}_i, \mathbf{z_j}$ are not close to orthogonal. Because there are ${{n}\choose{2}} = \frac{n(n-1)}{2}$ possible vector pairs, the union bound tells us that
$$ \mathbb{P}\bigg[\text{there exists a vector pair } \mathbf{z}_i, \mathbf{z}_j \text{ s.t. } \frac{|\mathbf{z}_i^\top \mathbf{z}_j|}{\|\mathbf{z}_i\|_2 \|\mathbf{z}_j\|_2} \geq t \bigg] \leq \frac{n(n-1)}{2}\cdot 4 e^{-\frac{t^2 d}{16}} \leq 2 n^2 e^{-\frac{t^2 d}{16}}. $$Pick $\delta > 0$ to be some small number, like $0.01$. Based on the above bound, we can now conclude that with probability $1 - \delta$, for any $t \in (0, 1],$ up to $\sqrt{\frac{\delta}{2}}e^{\frac{t^2 d}{32}}$ randomly generated Gaussian vectors are going to be near-orthogonal in the sense that $\frac{|\mathbf{z}_i^\top \mathbf{z}_j|}{\|\mathbf{z}_i\|_2 \|\mathbf{z}_j\|_2} < t$ for any $i, j \in \{1, 2, \dots, n\}$. In particular, treating $\delta$ and $t$ as constants independent of the dimension, we conclude with high probability up to exponentially many in $d$ independent Gaussian random vectors are all near-orthogonal.
In many data science applications we are often faced with datasets for which the dimension of the data (number of features) is extremely large. This is particularly true for datasets encountered in biology, medicine, and genetics. For such datasets, it is often impractical to store and perform computation on the entire data, due to their size. Fortunately, there is a simple dimension-reduction procedure that allows us to store and manipulate much lower-dimensional data vectors while preserving the information about their pairwise distances. The information about pairwise distances is sufficient when all we are interested in is identifying similarities/dissimilarities between data vectors. For example, if trying to determine ancestry relationships from genetic data, it may be sufficent to determine which groups of data vectors are the most similar ("closest") to each other.
Suppose that we are given a dataset consisting of $n$ $d$-dimensional vectors $\mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_n.$ The basic idea that we use here comes from the properties of high-dimensional Gaussian random vectors (drawn from $\mathcal{N}(\mathbf{0}, \mathbf{I})$). In particular, let $\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_k$ be independently drawn $d$-dimensional vectors from $\mathcal{N}(\mathbf{0}, \mathbf{I})$. Given a data vector $\mathbf{v},$ we compute and store the following $k$-dimensional vector:
$$ \pi(\mathbf{v}) = \frac{1}{\sqrt{k}}[\mathbf{u}_1^\top \mathbf{v}, \mathbf{u}_2^\top \mathbf{v}, \dots, \mathbf{u}_k^\top \mathbf{v}]^\top. $$We argue that up to a small error $\epsilon > 0$ that we choose, for a sufficiently large (but not too large) value of $k,$ the mapped vector $\pi(\mathbf{v})$ preserves the length of $\mathbf{v}$ as well as the pairwise distances between data vectors $\mathbf{v}_1, \mathbf{v}_2,\dots, \mathbf{v}_n.$
Observe that for a fixed unit-length vector $\mathbf{v},$ each $\mathbf{u}_i^\top \mathbf{v} = \sum_{j=1}^d u_{i, j} v_j$ for $i = 1, 2, \dots, k$ is a Gaussian random variable with mean zero and variance equal to $\|\mathbf{v}\|_2^2$ (argue this using what we had learned about Gaussian random variables). Because vectors $\mathbf{u}_i$ for $i = 1, 2, \dots, k$ are independent, we can conclude that $\frac{\sqrt{k}}{\|\mathbf{v}\|_2} \pi(\mathbf{v}) \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).$ Applying Gaussian Annulus Theorem that we proved earlier in this lecture, we have that
$$ \mathbb{P}\bigg[\Big|\frac{\sqrt{k}}{\|\mathbf{v}\|_2} \|\pi(\mathbf{v})\|_2 - \sqrt{k}\Big| \geq t\bigg] \leq 2 e^{-\frac{t^2}{8}}. $$The above probability is the same as $\mathbb{P}\bigg[\Big|\|\pi(\mathbf{v})\|_2 - \|\mathbf{v}\|_2 \Big| \geq \frac{t \|\mathbf{v}\|_2}{\sqrt{k}}\bigg]$. Thus, setting $t' = \frac{t}{\sqrt{k}},$ we can conclude that
$$ \mathbb{P}\bigg[\Big|\|\pi(\mathbf{v})\|_2 - \|\mathbf{v}\|_2 \Big| \geq t' \|\mathbf{v}\|_2\bigg] \leq 2 e^{-\frac{(t')^2 k}{8}}. $$Johnson-Lindenstrauss Lemma is based on this argument: it applies it to all pairwise distances $\|\mathbf{v}_i - \mathbf{v}_j\|_2$ for $j = 1, 2, \dots, n$ (all ${n}\choose{2}$ of them). The key observation is that the (projection) mapping $\pi(\mathbf{v})$ satisfies $\pi(\mathbf{v}_i) - \pi(\mathbf{v}_j) = \pi(\mathbf{v_i - v_j}),$ thus we can use the above reasoning for all vectors $\mathbf{v} = \mathbf{v}_i - \mathbf{v}_j$ and apply a union bound.
Proof Observe that the statement of the lemma can be equivalently written as
$$ \mathbb{P}\bigg[\exists i, j \in \{1, 2, \dots, n\} \text{ s.t. } \Big| \|\pi(\mathbf{v}_i) - \pi(\mathbf{v}_j)\|_2 - \|\mathbf{v}_i - \mathbf{v}_j\|_2\Big| \geq \epsilon \|\mathbf{v}_i - \mathbf{v}_j\|_2\bigg] \leq \delta. $$For any fixed vector pair $\mathbf{v}_i, \mathbf{v}_j$ (setting $\mathbf{v}= \mathbf{v}_i - \mathbf{v}_j$ and $t' = \epsilon$) we have already argued above that
$$ \mathbb{P}\bigg[\Big| \|\pi(\mathbf{v}_i) - \pi(\mathbf{v}_j)\|_2 - \|\mathbf{v}_i - \mathbf{v}_j\|_2\Big| \geq \epsilon \|\mathbf{v}_i - \mathbf{v}_j\|_2\bigg] \leq 2 e^{- \frac{\epsilon^2 k}{8}}. $$Since there are ${n}\choose{2}$ $= \frac{n(n-1)}{2}$ possible $\mathbf{v}_i, \mathbf{v}_j$ pairs, applying the union bound we conclude that
$$ \mathbb{P}\bigg[\exists i, j \in \{1, 2, \dots, n\} \text{ s.t. } \Big| \|\pi(\mathbf{v}_i) - \pi(\mathbf{v}_j)\|_2 - \|\mathbf{v}_i - \mathbf{v}_j\|_2\Big| \geq \epsilon \|\mathbf{v}_i - \mathbf{v}_j\|_2\bigg] \leq \frac{n(n-1)}{2} 2 e^{- \frac{\epsilon^2 k}{8}} < n^2 e^{- \frac{\epsilon^2 k}{8}}. $$To reach the conclusion of the lemma, it remains to ensure that $k$ is sufficiently large so that the above probability is bounded by $\delta$. In particular, solving
$$ n^2 e^{- \frac{\epsilon^2 k}{8}} \leq \delta $$for $k$, we get
$$ k \geq \frac{8}{\epsilon^2}\log\Big(\frac{n^2}{\delta}\Big) $$and thus choosing $k = \left \lceil \frac{16}{\epsilon^2}\log\big(\frac{n}{\delta}\big)\right\rceil$ suffices. $\blacksquare$
I used Chapter 2 in the book "Foundations of Data Science" by Avrim Blum, John Hopcroft, and Ravi Kannan as the foundation for this lecture. The precise statements of concentration bounds for Gaussians are somewhat different, since I adapted them to follow the material from previous lectures in this course.