\(\require{physics}\)
Eigenlearning

Eigenlearning

Jun 11, 2025

Some notes on the Eigenlearning paper by Simon et al.

Preliminaries

We define our dataset \(\mathcal{D} = \{x_i\}_{i=1}^n \overset{i.i.d.}{\sim} p\), where \(p\) is some distribution over \(\mathbb{R}^d\). As usual, our goal is to learn a function \(f\) given the (noisy) labels \(y \doteq \{f(x_i) + \eta_i\}_{i=1}^n\), where \(\eta_i \sim \mathcal{N}(0, \epsilon^2)\). We also induce the inner product and norm on scalar functions \(g, h\) defined as \(\langle g, h \rangle = \mathbb{E}_{x \sim p}[g(x)h(x)]\) and \(\norm{g}^2 = \langle g, g \rangle\).

We study only the function \begin{equation} \hat{f}(x) =k_{x\mathcal{D}}(K_{\mathcal{DD}} + \delta I_n)^{-1}y, \end{equation} where \([k_{x\mathcal{D}}]_i = K(x, x_i)\) and \([K_{\mathcal{DD}}]_{i,j} = K(x_i, x_j)\).

We wish to minimize \begin{equation} \mathcal{E}^{(\mathcal{D})}(f) = \norm{f - \hat{f}}^2 + \epsilon^2 \quad \text{ and } \quad \mathcal{E}(f) = \mathbb{E}_{\mathcal{D}}[\mathcal{E}^{(\mathcal{D})}(f)]. \end{equation} We also define our bias and variance as \(\mathcal{B}(f) = \norm{f - \mathbb{E}_\mathcal{D}[\hat{f}]}^2 + \epsilon^2\) and \(\mathcal{V}(f) = \mathcal{E}(f) - \mathcal{B}(f)\). Finally, we may define our train MSE in the typical way: \begin{equation} \mathcal{E}_{tr}(f) = \frac{1}{n}\sum_{i=1}^n (f(x_i) - \hat{f}(x_i))^2. \end{equation}

From Mercer's Theorem, we have that any PSD kernel \(K\) (for a refresher on kernels, see this previous post) admits a spectral decomposition \begin{equation} K(x, x^\prime) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x^\prime), \end{equation} where \(\lambda_i \geq 0\), \(\langle \phi_i, \phi_j \rangle = \delta_{ij}\). We assume that \(\lambda_i\) are arranged in non-increasing order. From here, we can take \begin{equation} f(x) = \sum_i \mathbf{v}_i \phi_i(x) \quad \text{ and } \quad \hat{f}(x) = \sum_i \hat{\mathbf{v}}_i \phi_i(x), \end{equation} for \((\mathbf{v}_i), (\hat{\mathbf{v}}_i)\) the vectors of eigencoefficients.

Learnability

We define learnability as a measure somewhat analogous to test MSE (but linear). For any function \(f\) so that \(\norm{f} = 1\), we define \begin{equation} \mathcal{L}^{(\mathcal{D})}(f) \doteq \langle f, \hat{f}\rangle \quad \text{ and }\quad \mathcal{L}(f) \doteq \mathbb{E}_\mathcal{D}\big[\mathcal{L}^{(\mathcal{D})}(f)\big], \end{equation} referred to as learnability and \(\mathcal{D}\)-learnability, respectively. This is effectively just cosine similarity (up to normalization).

Properties

Learability has several useful properties, listed below.

Properties of Learnability
The following hold for any \(f\) such that \(\norm{f} = 1\):
  1. \(\mathcal{L}(\phi_i), \mathcal{L}^\mathcal{D}(\phi_i) \in [0, 1]\).
  2. For \(n=0\) \(\mathcal{L}^{\mathcal{D}}(f) = \mathcal{L}(f) = 0\).
  3. For \(\mathcal{D}_{+} \doteq \mathcal{D} \cup x\), where \(x \in X, x \notin \mathcal{D}\), \(\mathcal{L}^{\mathcal{D}_{+}}(\phi_i) \geq \mathcal{L}^{\mathcal{D}}(\phi_i)\).
  4. \(\pdv{\lambda_i}\mathcal{L}^{\mathcal{D}}(\phi_i) \geq 0\), \(\pdv{\lambda_i}\mathcal{L}^{\mathcal{D}}(\phi_j) \leq 0\), and \(\pdv{\delta}\mathcal{L}^{\mathcal{D}}(\phi_i) \leq 0\).
  5. \(\mathcal{E}(f) \geq \mathcal{B}(f) \geq (1 - \mathcal{L}(f))^2\).

With the 5 properties above, we have a very intuitive idea of the learning process. From (a-c), we have that learnability is bounded between 0 and 1 (inclusive) and that it increases monotonically from 0 as data points are added, reaching a maximum of 1 in the infinite data limit (where we have ridge parameter \(\delta = 0\)). From (d), we have competition of eigenmodes: increasing only a single eigenvalue while holding the others fixed only decreases the learnabilities of the other eigenmodes. (e) provides a lower bound on test MSE.

Conservation Law

Now, viewing KRR as a projection of \(f\) onto the \(n\)-dimensional subspace of the RKHS (for a refresher on RKHSs, see this previous post) defined by the \(n\) samples.

Conservation of Learnability
For any complete basis of orthogonal functions \(\mathcal{F}\), when \(\delta = 0\), \begin{equation} \sum_{f \in \mathcal{F}} \mathcal{L}^{\mathcal{D}}(f) = \sum_{f \in \mathcal{F}} \mathcal{L}(f) = n, \end{equation} and when \(\delta > 0\), \begin{equation} \sum_{f \in \mathcal{F}} \mathcal{L}^{\mathcal{D}}(f) \leq n \quad \text{ and } \quad \sum_{f \in \mathcal{F}} \mathcal{L}(f) \leq n. \end{equation}

This is also a fairly intuitive result; total learnability is at most the number of training examples. And averaged over all functions, all kernels have the same learnability. This is more or less the "no-free-lunch" theorem: averaged over all functions, every model performs as random chance.

Proof of Conservation Law

Learning Transfer Matrix

From the spectral decomposition of the kernel matrix, we may decompose \(K_{\mathcal{DD}} = {\Phi}^\top \Lambda {\Phi}\), where \({\Phi}\) is the design matrix of eigenfunctions evaluated at the data points (defined as \({\Phi}_{ij} \doteq \phi_i(x_j)\)) and \(\Lambda\) is the diagonal matrix of eigenvalues. Using orthonormality of eigenfunctions and defining \([\phi_i]_j \doteq \phi_i(x_j)\), the predicted function coefficients \(\hat{\mathbf{v}}\) are \begin{equation} \hat{\mathbf{v}}_i \doteq \langle \phi_i, \hat{f}\rangle = \lambda_i \phi_i (K_{\mathcal{DD}} + \delta I_n)^{-1} \phi_i^\top \mathbf{v}. \end{equation} Then, we have \begin{equation} \hat{\mathbf{v}} = T^{(\mathcal{D})} \mathbf{v}, \end{equation} where \(T^{(\mathcal{D})}\) is the learning transfer matrix that describes the model's learning behavior on the training set \(\mathcal{D}\).

Clearly, \(\mathcal{L}^\mathcal{D}(\phi_i) = T_{ii}^{\mathcal{D}}\) and \(\mathcal{L}(\phi_i) = T_{ii}\) (this is easily verified by setting \(\mathbf{v}\) as a cannonical basis vector, i.e. 0 in all indices except the \(i\)-th).

Having defined the learning transfer matrix, showing the conservation law is straightforward.

Proof

Note that the learning transfer matrix is a projector onto the \(n\)-dimensional subspace of the RKHS defined by the \(n\) training points. Then, for any orthognal basis \(\mathcal{F}\) on \(X\), \begin{equation} \sum_{f \in \mathcal{F}} \mathcal{L}^{\mathcal{D}}(f) = \sum_{\mathbf{v} \in \mathcal{V}} \frac{\mathbf{v}^\top T^{(\mathcal{D})} \mathbf{v}}{\mathbf{v}^\top \mathbf{v}} \equiv \operatorname{Tr}\big[T^{(\mathcal{D})}\big], \end{equation} where \(\mathcal{V}\) is an orthonormal basis of \(\mathbb{R}^M\). Since \(T^{(\mathcal{D})} \doteq \Phi^\top \Lambda \Phi (\Phi^\top \Lambda \Phi + \delta I_n)^{-1} = K(K + \delta I_n)^{-1}\), the trace is \(n\) for \(\delta = 0\) and less than \(n\) for \(\delta > 0\) as claimed.

Eigenlearning Equations

Here we restate the eigenlearning equations and some interpretation. I might add derivations later but they seem to be pretty tedious and relatively straightforward.

Eigenlearning Equations
First define the effective regularization \(\kappa\) as the unique positive solution to \begin{equation} n = \sum_i \frac{\lambda_i}{\lambda_i + \kappa} + \frac{\delta}{\kappa}. \end{equation} Then, we have
  • \(\mathcal{L}(\phi_i) = \mathcal{L}_i \doteq \frac{\lambda_i}{\lambda_i + \kappa}\)
  • \(\mathcal{E}_0 = n \frac{\partial \kappa}{\partial \delta} = \frac{n}{n - \sum_i \ell_i^2}\)
  • \(\mathcal{E}(f) = \mathcal{E}_0 \left( \sum_i (1 - \mathcal{L}_i)^2 v_i^2 + \epsilon^2 \right)\)
  • \(\mathcal{B}(f) = \sum_i (1 - \mathcal{L}_i)^2 v_i^2 + \epsilon^2 = \frac{\mathcal{E}(f)}{\mathcal{E}_0}\)
  • \(\mathcal{V}(f) = \mathcal{E}(f) - \mathcal{B}(f) = \left(\frac{\mathcal{E}_0 - 1}{\mathcal{E}_0}\right) \mathcal{E}(f)\)
  • \(\mathcal{E}_{\text{tr}}(f) = \frac{\delta^2}{n^2 \kappa^2} \mathcal{E}(f)\)
  • \(\mathbb{E}[\hat{v}_i] = \mathcal{L}_i v_i\)
  • \(\mathrm{Cov}[\hat{v}_i, \hat{v}_j] = \frac{\mathcal{E}(f)}{n} \mathcal{L}_i^2 \delta_{ij}\)

Perhaps the most intuitive result is the omniscient risk estimator for the test MSE (the third equation above) which tells us that modes that are fully learned (with learnability 1) do not contribute to the test MSE. The overfitting coefficient (the second equation above) gives us the MSE when trained on pure noise and is the factor by which pure noise is overfit. In this interpretation, ovefitting of noise is overconfidence on the part of the kernel that the target function lies in the top \(n\) subspace. Distributing learnability amongst a wider distribution (or giving part of the learnability budget to the ridge parameter \(\delta\)) fixes this problem. The fourth and fifth equations above give us the bias and variance only in terms of \(\mathcal{E}_0\) and \(\mathcal{E}(f)\). Notably, bias strictly decreases as \(n\) grows.

Finally, we note that each new unit of learnability is distributed amongst eigenmodes as \begin{equation} \dv{\mathcal{L_i}}{n} = n^{-1}\mathcal{E}_0 r_i = \frac{r_i}{\sum_j r_j + \delta / \kappa}. \end{equation}

Here, \(r_i \doteq \mathcal{L}_i(1 - \mathcal{L}_i)\) is the rate at which the \(i\)-th eigenmode is learned. This is maximized for \(\mathcal{L}_i = 1/2\); partially learned eigenmodes benefit most from more data.

There's much more in the paper, but the above are the main takeaways. I'll likely come back later and add some more derivations, connections to the free Fermi gas, and some other cool results from the paper, but for now, this is good enough.