DR

Eigenlearning

Jun 11, 2025

Some notes on the Eigenlearning paper by Simon et al.

Preliminaries

We define our dataset D={xi}i=1ni.i.d.p\mathcal{D} = \{x_i\}_{i=1}^n \overset{i.i.d.}{\sim} p, where pp is some distribution over Rd\mathbb{R}^d. As usual, our goal is to learn a function ff given the (noisy) labels y{f(xi)+ηi}i=1ny \doteq \{f(x_i) + \eta_i\}_{i=1}^n, where ηiN(0,ϵ2)\eta_i \sim \mathcal{N}(0, \epsilon^2). We also induce the inner product and norm on scalar functions g,hg, h defined as g,h=Exp[g(x)h(x)]\langle g, h \rangle = \mathbb{E}_{x \sim p}[g(x)h(x)] and g2=g,g\left\| g \right\|^2 = \langle g, g \rangle.

We study only the function f^(x)=kxD(KDD+δIn)1y, \hat{f}(x) =k_{x\mathcal{D}}(K_{\mathcal{DD}} + \delta I_n)^{-1}y, where [kxD]i=K(x,xi)[k_{x\mathcal{D}}]_i = K(x, x_i) and [KDD]i,j=K(xi,xj)[K_{\mathcal{DD}}]_{i,j} = K(x_i, x_j).

We wish to minimize E(D)(f)=ff^2+ϵ2 and E(f)=ED[E(D)(f)]. \mathcal{E}^{(\mathcal{D})}(f) = \left\| f - \hat{f} \right\|^2 + \epsilon^2 \quad \text{ and } \quad \mathcal{E}(f) = \mathbb{E}_{\mathcal{D}}[\mathcal{E}^{(\mathcal{D})}(f)]. We also define our bias and variance as B(f)=fED[f^]2+ϵ2\mathcal{B}(f) = \left\| f - \mathbb{E}_\mathcal{D}[\hat{f}] \right\|^2 + \epsilon^2 and V(f)=E(f)B(f)\mathcal{V}(f) = \mathcal{E}(f) - \mathcal{B}(f). Finally, we may define our train MSE in the typical way: Etr(f)=1ni=1n(f(xi)f^(xi))2. \mathcal{E}_{tr}(f) = \frac{1}{n}\sum_{i=1}^n (f(x_i) - \hat{f}(x_i))^2.

From Mercer's Theorem, we have that any PSD kernel KK (for a refresher on kernels, see this previous post) admits a spectral decomposition K(x,x)=i=1λiϕi(x)ϕi(x), K(x, x^\prime) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(x^\prime), where λi0\lambda_i \geq 0, ϕi,ϕj=δij\langle \phi_i, \phi_j \rangle = \delta_{ij}. We assume that λi\lambda_i are arranged in non-increasing order. From here, we can take f(x)=iviϕi(x) and f^(x)=iv^iϕi(x), f(x) = \sum_i \mathbf{v}_i \phi_i(x) \quad \text{ and } \quad \hat{f}(x) = \sum_i \hat{\mathbf{v}}_i \phi_i(x), for (vi),(v^i)(\mathbf{v}_i), (\hat{\mathbf{v}}_i) the vectors of eigencoefficients.

Learnability

We define learnability as a measure somewhat analogous to test MSE (but linear). For any function ff so that f=1\left\| f \right\| = 1, we define L(D)(f)f,f^ and L(f)ED[L(D)(f)], \mathcal{L}^{(\mathcal{D})}(f) \doteq \langle f, \hat{f}\rangle \quad \text{ and }\quad \mathcal{L}(f) \doteq \mathbb{E}_\mathcal{D}\big[\mathcal{L}^{(\mathcal{D})}(f)\big], referred to as learnability and D\mathcal{D}-learnability, respectively. This is effectively just cosine similarity (up to normalization).

Properties

Learability has several useful properties, listed below.

Properties of Learnability
The following hold for any ff such that f=1\left\| f \right\| = 1:
  1. L(ϕi),LD(ϕi)[0,1]\mathcal{L}(\phi_i), \mathcal{L}^\mathcal{D}(\phi_i) \in [0, 1].
  2. For n=0n=0 LD(f)=L(f)=0\mathcal{L}^{\mathcal{D}}(f) = \mathcal{L}(f) = 0.
  3. For D+Dx\mathcal{D}_{+} \doteq \mathcal{D} \cup x, where xX,xDx \in X, x \notin \mathcal{D}, LD+(ϕi)LD(ϕi)\mathcal{L}^{\mathcal{D}_{+}}(\phi_i) \geq \mathcal{L}^{\mathcal{D}}(\phi_i).
  4. λiLD(ϕi)0\frac{\partial}{\partial \lambda_i}\mathcal{L}^{\mathcal{D}}(\phi_i) \geq 0, λiLD(ϕj)0\frac{\partial}{\partial \lambda_i}\mathcal{L}^{\mathcal{D}}(\phi_j) \leq 0, and δLD(ϕi)0\frac{\partial}{\partial \delta}\mathcal{L}^{\mathcal{D}}(\phi_i) \leq 0.
  5. E(f)B(f)(1L(f))2\mathcal{E}(f) \geq \mathcal{B}(f) \geq (1 - \mathcal{L}(f))^2.

With the 5 properties above, we have a very intuitive idea of the learning process. From (a-c), we have that learnability is bounded between 0 and 1 (inclusive) and that it increases monotonically from 0 as data points are added, reaching a maximum of 1 in the infinite data limit (where we have ridge parameter δ=0\delta = 0). From (d), we have competition of eigenmodes: increasing only a single eigenvalue while holding the others fixed only decreases the learnabilities of the other eigenmodes. (e) provides a lower bound on test MSE.

Conservation Law

Now, viewing KRR as a projection of ff onto the nn-dimensional subspace of the RKHS (for a refresher on RKHSs, see this previous post) defined by the nn samples.

Conservation of Learnability
For any complete basis of orthogonal functions F\mathcal{F}, when δ=0\delta = 0, fFLD(f)=fFL(f)=n, \sum_{f \in \mathcal{F}} \mathcal{L}^{\mathcal{D}}(f) = \sum_{f \in \mathcal{F}} \mathcal{L}(f) = n, and when δ>0\delta > 0, fFLD(f)n and fFL(f)n. \sum_{f \in \mathcal{F}} \mathcal{L}^{\mathcal{D}}(f) \leq n \quad \text{ and } \quad \sum_{f \in \mathcal{F}} \mathcal{L}(f) \leq n.

This is also a fairly intuitive result; total learnability is at most the number of training examples. And averaged over all functions, all kernels have the same learnability. This is more or less the "no-free-lunch" theorem: averaged over all functions, every model performs as random chance.

Proof of Conservation Law

Learning Transfer Matrix

From the spectral decomposition of the kernel matrix, we may decompose KDD=ΦΛΦK_{\mathcal{DD}} = {\Phi}^\top \Lambda {\Phi}, where Φ{\Phi} is the design matrix of eigenfunctions evaluated at the data points (defined as Φijϕi(xj){\Phi}_{ij} \doteq \phi_i(x_j)) and Λ\Lambda is the diagonal matrix of eigenvalues. Using orthonormality of eigenfunctions and defining [ϕi]jϕi(xj)[\phi_i]_j \doteq \phi_i(x_j), the predicted function coefficients v^\hat{\mathbf{v}} are v^iϕi,f^=λiϕi(KDD+δIn)1ϕiv. \hat{\mathbf{v}}_i \doteq \langle \phi_i, \hat{f}\rangle = \lambda_i \phi_i (K_{\mathcal{DD}} + \delta I_n)^{-1} \phi_i^\top \mathbf{v}. Then, we have v^=T(D)v, \hat{\mathbf{v}} = T^{(\mathcal{D})} \mathbf{v}, where T(D)T^{(\mathcal{D})} is the learning transfer matrix that describes the model's learning behavior on the training set D\mathcal{D}.

Clearly, LD(ϕi)=TiiD\mathcal{L}^\mathcal{D}(\phi_i) = T_{ii}^{\mathcal{D}} and L(ϕi)=Tii\mathcal{L}(\phi_i) = T_{ii} (this is easily verified by setting v\mathbf{v} as a cannonical basis vector, i.e. 0 in all indices except the ii-th).

Having defined the learning transfer matrix, showing the conservation law is straightforward.

Proof

Note that the learning transfer matrix is a projector onto the nn-dimensional subspace of the RKHS defined by the nn training points. Then, for any orthognal basis F\mathcal{F} on XX, fFLD(f)=vVvT(D)vvvTr[T(D)], \sum_{f \in \mathcal{F}} \mathcal{L}^{\mathcal{D}}(f) = \sum_{\mathbf{v} \in \mathcal{V}} \frac{\mathbf{v}^\top T^{(\mathcal{D})} \mathbf{v}}{\mathbf{v}^\top \mathbf{v}} \equiv \operatorname{Tr}\big[T^{(\mathcal{D})}\big], where V\mathcal{V} is an orthonormal basis of RM\mathbb{R}^M. Since T(D)ΦΛΦ(ΦΛΦ+δIn)1=K(K+δIn)1T^{(\mathcal{D})} \doteq \Phi^\top \Lambda \Phi (\Phi^\top \Lambda \Phi + \delta I_n)^{-1} = K(K + \delta I_n)^{-1}, the trace is nn for δ=0\delta = 0 and less than nn for δ>0\delta > 0 as claimed.

Eigenlearning Equations

Here we restate the eigenlearning equations and some interpretation. I might add derivations later but they seem to be pretty tedious and relatively straightforward.

Eigenlearning Equations
First define the effective regularization κ\kappa as the unique positive solution to n=iλiλi+κ+δκ. n = \sum_i \frac{\lambda_i}{\lambda_i + \kappa} + \frac{\delta}{\kappa}. Then, we have
  • L(ϕi)=Liλiλi+κ\mathcal{L}(\phi_i) = \mathcal{L}_i \doteq \frac{\lambda_i}{\lambda_i + \kappa}
  • E0=nκδ=nnii2\mathcal{E}_0 = n \frac{\partial \kappa}{\partial \delta} = \frac{n}{n - \sum_i \ell_i^2}
  • E(f)=E0(i(1Li)2vi2+ϵ2)\mathcal{E}(f) = \mathcal{E}_0 \left( \sum_i (1 - \mathcal{L}_i)^2 v_i^2 + \epsilon^2 \right)
  • B(f)=i(1Li)2vi2+ϵ2=E(f)E0\mathcal{B}(f) = \sum_i (1 - \mathcal{L}_i)^2 v_i^2 + \epsilon^2 = \frac{\mathcal{E}(f)}{\mathcal{E}_0}
  • V(f)=E(f)B(f)=(E01E0)E(f)\mathcal{V}(f) = \mathcal{E}(f) - \mathcal{B}(f) = \left(\frac{\mathcal{E}_0 - 1}{\mathcal{E}_0}\right) \mathcal{E}(f)
  • Etr(f)=δ2n2κ2E(f)\mathcal{E}_{\text{tr}}(f) = \frac{\delta^2}{n^2 \kappa^2} \mathcal{E}(f)
  • E[v^i]=Livi\mathbb{E}[\hat{v}_i] = \mathcal{L}_i v_i
  • Cov[v^i,v^j]=E(f)nLi2δij\mathrm{Cov}[\hat{v}_i, \hat{v}_j] = \frac{\mathcal{E}(f)}{n} \mathcal{L}_i^2 \delta_{ij}

Perhaps the most intuitive result is the omniscient risk estimator for the test MSE (the third equation above) which tells us that modes that are fully learned (with learnability 1) do not contribute to the test MSE. The overfitting coefficient (the second equation above) gives us the MSE when trained on pure noise and is the factor by which pure noise is overfit. In this interpretation, ovefitting of noise is overconfidence on the part of the kernel that the target function lies in the top nn subspace. Distributing learnability amongst a wider distribution (or giving part of the learnability budget to the ridge parameter δ\delta) fixes this problem. The fourth and fifth equations above give us the bias and variance only in terms of E0\mathcal{E}_0 and E(f)\mathcal{E}(f). Notably, bias strictly decreases as nn grows.

Finally, we note that each new unit of learnability is distributed amongst eigenmodes as dLidn=n1E0ri=rijrj+δ/κ. \frac{d \mathcal{L_i}}{d n} = n^{-1}\mathcal{E}_0 r_i = \frac{r_i}{\sum_j r_j + \delta / \kappa}.

Here, riLi(1Li)r_i \doteq \mathcal{L}_i(1 - \mathcal{L}_i) is the rate at which the ii-th eigenmode is learned. This is maximized for Li=1/2\mathcal{L}_i = 1/2; partially learned eigenmodes benefit most from more data.

There's much more in the paper, but the above are the main takeaways. I'll likely come back later and add some more derivations, connections to the free Fermi gas, and some other cool results from the paper, but for now, this is good enough.