Some notes on the Eigenlearning paper by Simon et al.
Preliminaries
We define our dataset D={xi}i=1n∼i.i.d.p, where p is some distribution over Rd. As usual, our goal is to learn a function f given the (noisy) labels y≐{f(xi)+ηi}i=1n, where ηi∼N(0,ϵ2). We also induce the inner product and norm on scalar functions g,h defined as ⟨g,h⟩=Ex∼p[g(x)h(x)] and ∥g∥2=⟨g,g⟩.
We study only the function f^(x)=kxD(KDD+δIn)−1y, where [kxD]i=K(x,xi) and [KDD]i,j=K(xi,xj).
We wish to minimize E(D)(f)=f−f^2+ϵ2 and E(f)=ED[E(D)(f)]. We also define our bias and variance as B(f)=f−ED[f^]2+ϵ2 and V(f)=E(f)−B(f). Finally, we may define our train MSE in the typical way: Etr(f)=n1i=1∑n(f(xi)−f^(xi))2.
From Mercer's Theorem, we have that any PSD kernel K (for a refresher on kernels, see this previous post) admits a spectral decomposition K(x,x′)=i=1∑∞λiϕi(x)ϕi(x′), where λi≥0, ⟨ϕi,ϕj⟩=δij. We assume that λi are arranged in non-increasing order. From here, we can take f(x)=i∑viϕi(x) and f^(x)=i∑v^iϕi(x), for (vi),(v^i) the vectors of eigencoefficients.
Learnability
We define learnability as a measure somewhat analogous to test MSE (but linear). For any function f so that ∥f∥=1, we define L(D)(f)≐⟨f,f^⟩ and L(f)≐ED[L(D)(f)], referred to as learnability and D-learnability, respectively. This is effectively just cosine similarity (up to normalization).
Properties
Learability has several useful properties, listed below.
Properties of Learnability
The following hold for any f such that ∥f∥=1:
L(ϕi),LD(ϕi)∈[0,1].
For n=0LD(f)=L(f)=0.
For D+≐D∪x, where x∈X,x∈/D, LD+(ϕi)≥LD(ϕi).
∂λi∂LD(ϕi)≥0, ∂λi∂LD(ϕj)≤0, and ∂δ∂LD(ϕi)≤0.
E(f)≥B(f)≥(1−L(f))2.
With the 5 properties above, we have a very intuitive idea of the learning process. From (a-c), we have that learnability is bounded between 0 and 1 (inclusive) and that it increases monotonically from 0 as data points are added, reaching a maximum of 1 in the infinite data limit (where we have ridge parameter δ=0). From (d), we have competition of eigenmodes: increasing only a single eigenvalue while holding the others fixed only decreases the learnabilities of the other eigenmodes. (e) provides a lower bound on test MSE.
Conservation Law
Now, viewing KRR as a projection of f onto the n-dimensional subspace of the RKHS (for a refresher on RKHSs, see this previous post) defined by the n samples.
Conservation of Learnability
For any complete basis of orthogonal functions F, when δ=0,
f∈F∑LD(f)=f∈F∑L(f)=n,
and when δ>0,
f∈F∑LD(f)≤n and f∈F∑L(f)≤n.
This is also a fairly intuitive result; total learnability is at most the number of training examples. And averaged over all functions, all kernels have the same learnability. This is more or less the "no-free-lunch" theorem: averaged over all functions, every model performs as random chance.
Proof of Conservation Law
Learning Transfer Matrix
From the spectral decomposition of the kernel matrix, we may decompose KDD=Φ⊤ΛΦ, where Φ is the design matrix of eigenfunctions evaluated at the data points (defined as Φij≐ϕi(xj)) and Λ is the diagonal matrix of eigenvalues. Using orthonormality of eigenfunctions and defining [ϕi]j≐ϕi(xj), the predicted function coefficients v^ are v^i≐⟨ϕi,f^⟩=λiϕi(KDD+δIn)−1ϕi⊤v. Then, we have v^=T(D)v, where T(D) is the learning transfer matrix that describes the model's learning behavior on the training set D.
Clearly, LD(ϕi)=TiiD and L(ϕi)=Tii (this is easily verified by setting v as a cannonical basis vector, i.e. 0 in all indices except the i-th).
Having defined the learning transfer matrix, showing the conservation law is straightforward.
Proof
Note that the learning transfer matrix is a projector onto the n-dimensional subspace of the RKHS defined by the n training points. Then, for any orthognal basis F on X, f∈F∑LD(f)=v∈V∑v⊤vv⊤T(D)v≡Tr[T(D)], where V is an orthonormal basis of RM. Since T(D)≐Φ⊤ΛΦ(Φ⊤ΛΦ+δIn)−1=K(K+δIn)−1, the trace is n for δ=0 and less than n for δ>0 as claimed.
Eigenlearning Equations
Here we restate the eigenlearning equations and some interpretation. I might add derivations later but they seem to be pretty tedious and relatively straightforward.
Eigenlearning Equations
First define the effective regularization κ as the unique positive solution to
n=i∑λi+κλi+κδ.
Then, we have
L(ϕi)=Li≐λi+κλi
E0=n∂δ∂κ=n−∑iℓi2n
E(f)=E0(∑i(1−Li)2vi2+ϵ2)
B(f)=∑i(1−Li)2vi2+ϵ2=E0E(f)
V(f)=E(f)−B(f)=(E0E0−1)E(f)
Etr(f)=n2κ2δ2E(f)
E[v^i]=Livi
Cov[v^i,v^j]=nE(f)Li2δij
Perhaps the most intuitive result is the omniscient risk estimator for the test MSE (the third equation above) which tells us that modes that are fully learned (with learnability 1) do not contribute to the test MSE. The overfitting coefficient (the second equation above) gives us the MSE when trained on pure noise and is the factor by which pure noise is overfit. In this interpretation, ovefitting of noise is overconfidence on the part of the kernel that the target function lies in the top n subspace. Distributing learnability amongst a wider distribution (or giving part of the learnability budget to the ridge parameter δ) fixes this problem. The fourth and fifth equations above give us the bias and variance only in terms of E0 and E(f). Notably, bias strictly decreases as n grows.
Finally, we note that each new unit of learnability is distributed amongst eigenmodes as dndLi=n−1E0ri=∑jrj+δ/κri.
Here, ri≐Li(1−Li) is the rate at which the i-th eigenmode is learned. This is maximized for Li=1/2; partially learned eigenmodes benefit most from more data.
There's much more in the paper, but the above are the main takeaways. I'll likely come back later and add some more derivations, connections to the free Fermi gas, and some other cool results from the paper, but for now, this is good enough.