A digest of the paper ``From Gauss to Kolmogorov: Localized measures of complexity for ellipses’’ by Yuting Wei, Billy Fang, and Martin Wainwright (EJS, 2020).
Much can be said of the Gaussian and Rademacher width objects. At a high level, the width of a set carries information about how complicated that set is and, in turn, how complicated it is to estimate a parameter which is one among many members of that set. Methods for bounding the width of a set have been extensively studied.
We will consider a localized notion of Gaussian width. At a high level, one may reason that the performance of a non-parametric estimation procedure should not be evaluated based on its ability to distinguish the truth from remote alternatives – what really matters is its ability to distinguish the truth from nearby alternatives which are more likely to be confused for the truth. If this intuition guides toward tighter estimation rates when using local notions of width, then we are motivated to find easy means of estimating the local Gaussian width of a set about a point. A result of this kind is the subject of a paper by Yuting Wei, Billy Fang, and Martin Wainwright (2020, (Wei, Fang, and J. Wainwright 2020)).
The basic idea is the following. There is some minimal dimension \(k_*\in\{1,2,\ldots,d\}\) for which \(d\)-dimensional data are approximated locally by a \(k_*\)-dimensional plane to \(\delta\)-accuracy. Then when the data are sufficiently regular in some sense, we can say that the local Gaussian width within a \(\delta\)-ball is the same as \(\delta\sqrt{k_*}\) up to some universal constant. However, while Gaussian width is hard to compute, it is fairly easy to find \(k_*\). Thus this result is quite concise and useful.
This report is organized as follows. We state all of the definitions that we need to make rigorous statements of the above ideas in background. Then we state the results of the paper and their most immediate consequences in results. We prove the first theorem in proofs, and we sketch the proof of the second (omitting a full proof which occupies five pages in our paper of interest). Finally, we summarize the entire report with the central idea and its consequences once again in discussion.
Again, the Gaussian width of a set is some notion of its size or, roughly, the difficulty of an estimation problem over it.
Definition 1 (Gaussian width) We define the Gaussian width of a set \(S\subset\mbox{$\mathbb{R}$}^d\) as \[\begin{align*} \mbox{$\mathcal{G}$}(S) &:= \mbox{$\mathbb{E}$}\left[\underset{u\in S}{\sup} \langle u, w\rangle\right]. && \left(w\sim N(0, I_d)\right) \end{align*}\] [Section 2.1 in (Wei, Fang, and J. Wainwright 2020)]
Definition 2 (Localized Gaussian width) The localized Gaussian width of \(S\) at \(u^*\in S\) and scale \(\delta\) is simply \(\mbox{$\mathcal{G}$}\left((S-u^*)\cap\mbox{$\mathbb{B}$}(\delta)\right)\), where \(S-u^*\) is the translation of \(S\) so that \(u^*\) coincides with the origin and \(\mbox{$\mathbb{B}$}(\delta)\) is an \(\ell_2\) ball of radius \(\delta\). [Section 3.1 in (Wei, Fang, and J. Wainwright 2020)]
That is, it is the Gaussian width not of the whole of set \(S\) but is the width of the subset of \(S\) which contains those points which are within \(\ell_2\) distance \(\delta\) from \(u^*\).
Definition 3 (Kolmogorov width) We define the Kolmogorov \(k\)-width of \(S\) as \[\begin{equation} W_k(S) := \underset{\Pi_k}{\min}\text{ }\underset{u\in S}{\max} \left\lVert u - \Pi_k u\right\rVert_2, \end{equation}\] where the \(\Pi_k\) are \(k\)-dimensional orthogonal linear projections. [Section 2.2 in (Wei, Fang, and J. Wainwright 2020)]
In words, the \(k\)-width of \(S\) is the maximum reconstruction error after projecting \(S\) onto the best-approximating \(k\)-dimensional subspace. It happens that, in the case where \(S\) is a space from which samples are observed, the problem of estimating the best projection \(\Pi_k\) is the same problem as computing the \(k\) first principal components of the data set. See Figure 1.
Figure 1: Intersecting an ellipse \(S\) and a \(\delta\)-ball around \(u^*\in S\), find the projection \(\Pi_k\) which has the minimax reconstruction error over \(S\cap \mbox{$\mathbb{B}$}(\delta)\). That error is the local Kolmogorov \(k\)-width of \(S\) for \(u^*\)
Clearly if \(k=d\) the dimension of the ambient space, then there is only one projection to consider (which is the identity projection) and ensures that \(W_d(S)=0\). On the other hand, if \(k=0\), then once again there is only one projection to consider (which is the origin alone) and \(W_0(S)=\max_{\theta\in S}\left\lVert\theta\right\rVert_2\). Furthermore, we have that \[\begin{align*} W_{k-1}(S) &\geq W_{k}(S). && (k=1\ldots,d) \end{align*}\] In words, the Kolmogorov \(k\)-width decreases as the subspaces with which we approximate \(S\) increase in dimension; this fact is tedious to prove but intuitive enough and useful later.
Definition 4 (Critical dimension) Pick \(\delta>0\) and \(\eta\in(0, 0.1)\). The critical dimension of \(S\) for \(u^*\) is \[\begin{equation} k_*(u^*, \delta) := \min\left\{k\in\{1,2,\ldots,d\} : W_k\left((S-u^*)\cap\mbox{$\mathbb{B}$}((1-\eta)\delta)\right) \leq 0.9\delta\right\}. \end{equation}\] [Equation 8 in (Wei, Fang, and J. Wainwright 2020)]
As was suggested in intro, this quantity can be considered the minimal dimension for which there exists a \(k_*\)-dimensional projection that approximates a neighborhood of \(S\) around \(u^*\) to \(0.9\delta\)-accuracy. In the case where \(S\) is a space from which samples are observed, \(k_*\) may be considered the minimum number of the top principal components of the dataset needed to capture \(1-0.9\delta\) times the total variation in the data in the \(\ell_2\) sense.
Note that here we consider \(0.9\delta\)-accuracy rather than simply \(\delta\)-accuracy for the reason that we can only expect the latter in the case of \(k_*=d\), an uninteresting case. Instead we are much more interested in a small dimension at which we attain a reasonable representation of our set. We could alternatively pick \(\eta\in(0,t)\) and pursue a \((1-t\delta)\)-accurate \(k_*\)-dimensional subspace for \(0<t<0.5\), say, but while toggling the parameter \(t\) in this way may lead to interestingly different results in practice, the argument follows the same anyhow.
For the motivating examples of the paper, the set \(S\) is a finite-dimensional ellipse \(\mbox{$\mathcal{E}$}^*_{\mu}\) which is defined as a ball \(\mbox{$\mathbb{B}$}(\theta^*; 1, \left\lVert\cdot\right\rVert_{\mu})\) around \(\theta^*\) and of radius \(1\), where \[\begin{align*} \left\lVert\theta\right\rVert_{\mu} = \sum_{i=1}^d \frac{\theta_j^2}{\mu_j}. && (\mu_1\geq\mu_2\geq\cdots\geq\mu_d\geq 0) \end{align*}\] We may ask how the difficulty of reconstructing \(\mbox{$\mathcal{E}$}^*_{\mu}\) with the projection \(\pi_k\) relates to the difficulty of reconstructing another ellipse \(\mbox{$\mathcal{E}$}^*_{\gamma}\) (which is similarly defined).
Definition 5 Consider the set \[\begin{equation*} \Gamma(\delta, \Pi_k) := \left\{\gamma\in\mbox{$\mathbb{R}$}_{+}^d : \underset{\left\lVert\Delta\right\rVert_{\mu}\leq \delta}{\sup}\left\lVert\Delta-\Pi_k\Delta\right\rVert_{\gamma}^2\leq 1\right\} \end{equation*}\] of all points \(\gamma\) which furnish ellipses for which the reconstruction error for all \(\Delta\in\mbox{$\mathcal{E}$}^*_{\mu}\) is small.
It is in terms of this set that we state our ``regularity conditions’’ for our central claim below.
Definition 6 If, of the ellipse \(\mbox{$\mathcal{E}$}^*_{\mu}\), we can say that \[\begin{equation} \delta\sqrt{k_*(\delta, \theta^*)} \gtrsim \underset{\gamma\in\Gamma(\delta,\Pi_{k_*})}{\inf} \sqrt{\sum_{j=1}^d \gamma_j}, \end{equation}\] then we say that \(\mbox{$\mathcal{E}$}^*_{\mu}\) is regular at \(\theta^*\).
The premise of the paper is, again, as follows. If, of the ellipse \(\mbox{$\mathcal{E}$}^*_{\mu}\), we can say that it is regular at \(\theta^*\), then the local Gaussian complexity of \(\mbox{$\mathcal{E}$}^*_{\mu}\) satisfies \[\begin{equation} \mbox{$\mathcal{G}$}\left(\mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)\right) \simeq \delta\sqrt{k_*(\theta^*, \delta)}. \tag{1} \end{equation}\] To show Claim (1), we require an upper bound and a lower bound on the local Gaussian complexity. The upper bound is the subject of Theorem 1. The lower bound is the subject of Theorem 2.
Toward the statement of a result of the form of equation , the authors offer an upper bound and a lower bound on the local Gaussian complexity of the ellipse. These are stated as follows.
Theorem 1 (Upper bound) Given \(\delta>0\), a tuple \((k, \Pi_k)\), and some \(\theta^*\in\mbox{$\mathcal{E}$}\), \[\begin{equation*} \mbox{$\mathcal{G}$}\left(\mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)\right) \leq \delta\sqrt{k} + \underset{\gamma\in\Gamma(\theta^*,\delta,\Pi_k)}{\inf} \sqrt{\sum_{j=1}^d \gamma_j}. \end{equation*}\] [Theorem 1 in (Wei, Fang, and J. Wainwright 2020)]
The proof of this theorem is reconstructed in the proofs section.
Theorem 2 (Lower bound) There exist universal constants \(c_1, c_2 > 0\) such that for all \(\theta^*\in\mbox{$\mathcal{E}$}\), \[\begin{equation*} \mbox{$\mathcal{G}$}\left(\mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)\right) \geq c_1\delta\sqrt{1-\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}^2}\sqrt{k_*(\theta^*, \delta)} \end{equation*}\] for all \(\delta\in\left(0,c_2\Phi^{-1}\left((\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}^{-1}-1)^2\right)\right)\). [Theorem 2 in (Wei, Fang, and J. Wainwright 2020)]
The proof of this theorem is sketched in proofs. Now we restate equation formally.
The lower bound in this theorem is immediate from . The upper bound can be drawn quickly from by simply noting that when the
To further illustrate the use of the above result, the following theorem assures the reasonable scaling of the critical dimension with \(\delta\) and and the ease with which it is approximately computed.
Proof (Proof of upper bound). For \(\Delta\in\mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)\) and \(\Pi_k\) a \(k\)-dimensional orthogonal projection, write \(\Delta=\Pi_k\Delta + (\Delta-\Pi_k\Delta)\). Then \[\begin{align*} \mbox{$\mathcal{G}$}\left(\mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)\right) &= \mbox{$\mathbb{E}$}\left[\underset{\Delta\in \mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)}{\sup} \langle w,\Delta\rangle\right]\\ &\leq \underset{T_1}{\underbrace{\mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle w,\Pi_k\Delta\rangle\right]}} + \underset{T_2}{\underbrace{\mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle w,\Delta-\Pi_k\Delta\rangle\right]}}, \end{align*}\] leaving the task to bound the terms \(T_1\) and \(T_2\) separately.
(Bounding \(T_1\).) Write \[\begin{align*} \mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle w,\Pi_k\Delta\rangle\right] &= \mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle \Pi_k w,\Pi_k\Delta\rangle\right] && ((I-\Pi_k)w\perp \Pi_k\Delta)\\ &\leq \underset{\Delta}{\sup}\left\lVert\Pi_k\Delta\right\rVert_2 \mbox{$\mathbb{E}$}\left\lVert\Pi_k w\right\rVert_2 && (Cauchy-Schwarz)\\ &\leq \delta \sqrt{k} \end{align*}\] using the fact that \(\left\lVert\Delta\right\rVert_2\leq \delta\) by definition and that \[\begin{align*} \mbox{$\mathbb{E}$}\left\lVert\Pi_k w\right\rVert_2 &\leq \sqrt{\sum_{i=1}^d \mbox{$\mathbb{E}$}(\Pi_k w)_i^2} && (Jensen)\\ &\leq \sqrt{\sum_{i=1}^d \mbox{$\mathbb{I}$}\left\{i\leq k\right\}}\\ &= \sqrt{k} \end{align*}\] since \(\Pi_k w\) has the same distribution as a \(k\)-dimensional standard Gaussian random variable projected up into \(d\)-dimensional space and then rotated according to \(\Pi_k\).
(Bounding \(T_2\).) Let \(A(\gamma)=\text{diag}(\sqrt{\gamma_1}, \sqrt{\gamma_2}, \ldots,\sqrt{\gamma_d}).\) Then write \[\begin{align*} \mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle w,\Delta-\Pi_k\Delta\rangle\right] &= \mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle Aw,A^{-1}(\Delta-\Pi_k\Delta)\rangle\right]\\ &\leq \underset{\Delta}{\sup} \left\lVert A^{-1}(\Delta-\Pi_k\Delta)\right\rVert_2 \mbox{$\mathbb{E}$}\left\lVert Aw\right\rVert_2. \end{align*}\] Of the first factor, note that for all \(\gamma\in\Gamma(\delta,\Pi_k)\) we have \[\begin{equation*} \left\lVert A(\gamma)^{-1}(\Delta-\Pi_k\Delta)\right\rVert_2 = \left\lVert\Delta-\Pi_k\Delta\right\rVert_{\gamma}\leq 1 \end{equation*}\] for all \(\left\lVert\Delta\right\rVert_{\mu}\leq 1\) by Definition . Of the second factor note that \[\begin{align*} \mbox{$\mathbb{E}$}\left\lVert Aw\right\rVert_2 &\leq \sqrt{\sum_{i=1}^d \mbox{$\mathbb{E}$}(A w)_i^2} && (Jensen)\\ &= \sqrt{\sum_{i=1}^d \gamma_i}. \end{align*}\] It follows that \[\begin{equation*} \mbox{$\mathbb{E}$}\left[\underset{\Delta}{\sup} \langle w,\Delta-\Pi_k\Delta\rangle\right] \leq \underset{\gamma\in\Gamma(\delta,\Pi_k)}{\inf}\sqrt{\sum_{i=1}^d \gamma_i} \end{equation*}\] since the left-hand side does not depend on the \(\gamma\). Combining this result with the bound on \(T_1\) concludes the proof for Theorem 1.
Proof (Sketch of proof of lower bound). Consider two separate cases.
(Case 1.) Suppose that \(\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}\leq\frac12\). The first step toward the claim in this setting is to show that \[\begin{equation*} \mbox{$\mathcal{G}$}\left(\mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)\right) \geq \mbox{$\mathcal{G}$}\left(\mbox{$\mathbb{B}$}\left(\frac{3}{10}\delta\right)\cap E_{k`_*}\right), \end{equation*}\] where \(E_k\subset\mbox{$\mathbb{R}$}^d\) is the subspace spanned by \(\{e_1,\ldots,e_k\}\) and \(k`_*\geq k_*\). In words, if \(\theta^*\) is close to the center of the ellipse, then its associated local Gaussian complexity is bounded below by the local Gaussian complexity of a \(k`_*\)-dimensional ball. It can be checked directly that the right-hand side of the above inequality is equal to \(\frac{3}{10}\delta\sqrt{k`_*}\), and by the assumption of this case we have that \[\begin{equation*} \frac{3}{10}\delta\sqrt{k`_*} \geq \frac{3}{10}\delta\sqrt{1-\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}^2}\sqrt{k_*(\theta^*,\delta)}, \end{equation*}\] which is the result. What remains to be seen is that that inequality is in fact true. Loosely speaking, it is indeed true because the set intersection of the right-hand complexity expression is smaller than the set intersection of the left-hand complexity expression.
(Case 2.) Suppose that \(\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}\geq\frac12\). The gist of Lemmas 1 and 2 is that there exists a vector \(\theta^{\dagger}\in\mbox{$\mathcal{E}$}\) and a packing \[\begin{align*} \Xi &= \{\theta^{\dagger}+\sigma^S: S\in\mbox{$\mathcal{J}$}, \sigma^S \text{ $s$-sparse}\} \subset \mbox{$\mathcal{E}$},\text{ where}\\ \mbox{$\mathcal{J}$}&\subset \mbox{$\mathcal{P}$}(\{\lfloor(k_*-1)/8\rfloor,\ldots,\lfloor(k_*-1)/4\rfloor\})\text{ and}\\ s &:= \rho\frac{k_*-1}{16} &&(\rho\in(0,1)) \end{align*}\] such that \[\begin{align*} \lvert\Xi\rvert &\geq \begin{pmatrix} \lfloor\frac{k_*-1}{16}\rfloor \\ s \end{pmatrix},\text{ and}\\ \delta^2 &\leq \left\lVert\theta^S - \theta^*\right\rVert_2^2 \leq \left(\frac{4}{1-\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}^2}\right)\delta^2. \end{align*}\] All this is to obtain a (large but tractable) collection of vectors \(\Delta^S\in\mbox{$\mathcal{E}$}_{\theta^*}\cap\mbox{$\mathbb{B}$}(\delta)\) proportional to \(\Tilde{\theta}-\theta^*\) where \(\Tilde{\theta}\in\Xi\) so that \[\begin{equation} \mbox{$\mathbb{E}$}\left[\underset{\Delta\in \mbox{$\mathcal{E}$}^*_{\mu}\cap\mbox{$\mathbb{B}$}(\delta)}{\sup} \langle w,\Delta\rangle\right] \geq \mbox{$\mathbb{E}$}\left[\underset{\Tilde{\theta}\in\Xi}{\max} \langle w,\Delta^S\rangle\right]. \end{equation}\] A further Lemma 3 takes the conditions for Theorem and combines it with the result of Lemma 2 to show that the right-hand side of the above inequality is an upper bound for \(\delta\sqrt{1-\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}^2}\sqrt{k_*}\) up to some universal constant. The argument for this step is somewhat similar to the part of the Proof of Theorem 1 () in that one can observe that the sparsity of the \(\Delta^S\) zero out \(d-k^*\) entries of the Gaussian vector \(w\) so that, by H"older, the \(\mbox{$\mathbb{E}$}\left\lVert w\right\rVert_{\infty}\) looks like \(\sqrt{k_*}\) and the \(\sup\left\lVert\Delta^S\right\rVert_1\) looks like \(\delta\sqrt{1-\left\lVert\theta^*\right\rVert_{\mbox{$\mathcal{E}$}}^2}\). This is, as one would expect, a gross oversimplification of the details of the full proof.
This result is quite similar to what we already know about the relation between global Gaussian width and least squares estimation rates on ellipses. This topic is discussed at length in (J. Wainwright 2019).
Can these results for bounding the local Gaussian width be imported to the question of bounding global Gaussian width? Not immediately can they be, but consider this. The hardest vector \(\theta^*\in\mbox{$\mathcal{E}$}\) to estimate is the zero vector since this is where the ellipse is the ``widest,’’ and the critical dimension of the ellipse around \(0\) is naturally going to be higher than the critical dimension anywhere else. Although \(\mbox{$\mathcal{G}$}(\mbox{$\mathcal{E}$}_0\cap\mbox{$\mathbb{B}$}(\delta))\) is not exactly \(\mbox{$\mathcal{G}$}(\mbox{$\mathcal{E}$})\), the two help capture the minimax difficulty of the estimation problem over \(\mbox{$\mathcal{E}$}\).