DR

Normal Gradients of Kernel Interpolants

Mar 06, 2026

Why models can overfit on noisy data and still generalize optimally

Motivation

A few months ago I was reading this paper, which distinguishes between benign, tempered, and catastrophic overfitting in kernel interpolants1. The classical picture of benign overfitting2 is a bit mysterious; the model memorizes noise in the training labels, but somehow achieves Bayes optimal error on the test set. A natural question is that if the model has memorized noise, why does that not affect its generalization performance -- in some sense, where is that noise "stored"?

The answer turns out to be geometric: the noise is pushed into the normal directions of the data manifold. A kernel interpolant trained on noisy labels looks smooth along the manifold (thus it generalizes), but oscillates wildly in directions perpendicular to it. This gives us a clean separation between two often conflated ideas:

Setup

Assume data x1,,xnμx_1, \dots, x_n \sim \mu are drawn from a probability measure supported on a compact dd-dimensional submanifold MRD\mathcal{M} \subset \mathbb{R}^D, with DdD \gg d. This is the manifold hypothesis: data lives near a low-dimensional structure embedded in high-dimensional space.

We observe noisy labels yf(x)+ϵ,ϵN(0,σ2I), y \doteq f^*(x) + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \sigma^2 I), and fit a positive-definite kernel kk via the minimum-norm interpolant3 (ridgeless KRR): f^(x)=k(x)K1y, \hat{f}(x) = k(x)^\top K^{-1} y, where k(x)=(k(x,x1),,k(x,xn))k(x) = (k(x, x_1), \dots, k(x, x_n))^\top and Kij=k(xi,xj)K_{ij} = k(x_i, x_j). This fits every training point exactly -- noise and all.

At any xMx \in \mathcal{M}, the ambient space splits orthogonally4: RD=TxMNxM. \mathbb{R}^D = T_x\mathcal{M} \oplus N_x\mathcal{M}. Let PT(x)P_T(x) and P(x)P_\perp(x) denote the projectors onto the tangent and normal bundles respectively. The gradient of f^\hat{f} splits accordingly: f^(x)=PT(x)f^(x)Tf^(x)+P(x)f^(x)f^(x). \nabla \hat{f}(x) = \underbrace{P_T(x)\,\nabla \hat{f}(x)}_{\nabla_T \hat{f}(x)} + \underbrace{P_\perp(x)\,\nabla \hat{f}(x)}_{\nabla_\perp \hat{f}(x)}.

Normal Gradient Energy

Our central object is the Normal Gradient Energy: ΓNExμ[f^(x)2]. \Gamma_N \doteq \mathbb{E}_{x \sim \mu}\left[\|\nabla_\perp \hat{f}(x)\|^2\right]. This measures how much the interpolant varies in the normal directions, averaged over the data distribution.

To compute it, introduce the normal gradient feature matrix at each xx: J(x)R(Dd)×n,J(x) ⁣:,iP(x)xk(x,xi), J(x) \in \mathbb{R}^{(D-d) \times n}, \qquad J(x)_{\colon, i} \doteq P_\perp(x)\,\nabla_x k(x, x_i), and the Gradient Kernel Matrix: GExμ ⁣[J(x)J(x)],Gij=Exμ ⁣[k(x,xi)k(x,xj)]. G \doteq \mathbb{E}_{x \sim \mu}\!\left[J(x)^\top J(x)\right], \qquad G_{ij} = \mathbb{E}_{x \sim \mu}\!\left[\nabla_\perp k(x, x_i) \cdot \nabla_\perp k(x, x_j)\right]. GG is SPSD and depends only on the kernel and manifold geometry not the labels.

Since f^(x)=k(x)α\hat{f}(x) = k(x)^\top \alpha with α=K1y\alpha = K^{-1} y, we can write f^(x)=J(x)α\nabla_\perp \hat{f}(x) = J(x)\alpha, so that ΓN=Exμ ⁣[J(x)α2]=αGα. \Gamma_N = \mathbb{E}_{x \sim \mu}\!\left[\|J(x)\alpha\|^2\right] = \alpha^\top G \alpha.

Taking expectation over the label noise with f0f^* \equiv 0 (pure noise labels, the worst case for overfitting):

Trace Formula for Expected Normal Gradient Energy
For the ridgeless minimum-norm interpolant with pure noise labels y=ϵN(0,σ2I)y = \epsilon \sim \mathcal{N}(0, \sigma^2 I), Eϵ[ΓN]=σ2Tr ⁣[K1GK1]. \mathbb{E}_\epsilon[\Gamma_N] = \sigma^2\,\operatorname{Tr}\!\left[K^{-1} G K^{-1}\right].

The proof is immediate: substituting α=K1ϵ\alpha = K^{-1}\epsilon into ΓN=αGα\Gamma_N = \alpha^\top G \alpha gives ϵK1GK1ϵ\epsilon^\top K^{-1} G K^{-1} \epsilon, and the Gaussian quadratic form identity E[ϵAϵ]=σ2Tr(A)\mathbb{E}[\epsilon^\top A \epsilon] = \sigma^2 \operatorname{Tr}(A) yields the result.

Phase Transition

If GG and KK are simultaneously diagonalizable5 with eigenvalues {γj}\{\gamma_j\} and {λj}\{\lambda_j\} in a shared basis, the trace formula reduces to: Eϵ[ΓN]=σ2j=1nγjλj2. \mathbb{E}_\epsilon[\Gamma_N] = \sigma^2 \sum_{j=1}^n \frac{\gamma_j}{\lambda_j^2}.

For translation-invariant kernels on homogeneous manifolds, we pass to a continuum approximation indexed by spatial frequencies qRdq \in \mathbb{R}^d: λ(q)k^(q),γ(q)Pq2k^(q). \lambda(q) \approx \widehat{k}(q), \qquad \gamma(q) \approx \|P_\perp q\|^2\,\widehat{k}(q).

The normal gradient contributes an extra factor of q2q^2 (from differentiation) relative to the kernel eigenvalues. For the Laplace kernel6 k(x,x)=exp(xx/)k(x, x') = \exp(-\|x - x'\|/\ell), whose spectrum satisfies k^(q)(1+2q2)(d+1)/2\widehat{k}(q) \asymp (1 + \ell^2 q^2)^{-(d+1)/2}, the normal gradient energy becomes: ΓNqminqmaxqd+1(1+2q2)2dq, \Gamma_N \asymp \int_{q_{\min}}^{q_{\max}} q^{d+1}\,(1 + \ell^2 q^2)^{-2}\,dq, where qminn1/dq_{\min} \sim n^{-1/d} and qmaxn1/dq_{\max} \sim n^{1/d} are the data-supported frequency extremes.7

At high frequencies, the integrand behaves as qd3q^{d-3}, and the integral diverges if and only if d3>1d - 3 > -1, i.e., d>2d > 2. Evaluating:

Normal Gradient Energy Scaling (Laplace Kernel)
ΓN{O(1),d<2,O(logn),d=2,O(n(d2)/d),d>2. \Gamma_N \asymp \begin{cases} O(1), & d \lt 2, \\ O(\log n), & d = 2, \\ O(n^{(d-2)/d}), & d > 2. \end{cases} The critical dimension is d=2d = 2. Below it, normal gradient energy stays bounded as nn \to \infty. Above it, it grows polynomially in sample size.
Empirical scaling-law verification for normal gradient energy across manifold dimensions
Empirical scaling of ΓN\Gamma_N vs. nn: bounded for d<2d \lt 2, logarithmic at d=2d = 2, and polynomial for d>2d \gt 2.

Intuitively, as nn increases, the accessible frequency range grows (more data resolves finer structure). The normal gradient operator amplifies high frequencies by q2q^2. When d>2d > 2, the Laplace kernel's spectral decay is not fast enough to counteract this, and normal gradient energy diverges.

Where the Noise Goes

This gives a precise answer to the original question. When the interpolant fits noisy labels:

Concretely: the function values f^(x)\hat{f}(x) remain O(1)O(1), the RKHS norm stays finite, and the interpolant fits the data exactly -- yet the normal gradient blows up. The noise is not "gone"; it is pushed into the geometry perpendicular to the data manifold.

A useful way to see this involves the inter-point spacing δnn1/d\delta_n \sim n^{-1/d} and the screening length8 ξnn1/(d+1)\xi_n \sim n^{-1/(d+1)} of the equivalent kernel. Since ξnδnn1/(d(d+1)), \frac{\xi_n}{\delta_n} \asymp n^{1/(d(d+1))} \longrightarrow \infty, the equivalent kernel actually gets wider relative to inter-point spacing as nn \to \infty. The interpolant is not building sharper spikes around each training point -- it is becoming more spread out. The divergence in ΓN\Gamma_N is therefore driven by amplitude growth in the normal direction, not spike sharpening in the tangential one.

Adversarial Robustness

The divergence of ΓN\Gamma_N has a direct consequence for adversarial robustness. For a point xMx \in \mathcal{M} and an off-manifold perturbation δNxM\delta \in N_x\mathcal{M}, a first-order Taylor expansion gives: f^(x+δ)f^(x)+f^(x)δ. \hat{f}(x + \delta) \approx \hat{f}(x) + \nabla_\perp \hat{f}(x) \cdot \delta. The minimal-norm perturbation that flips the prediction is: δ(x)=f^(x)f^(x)2f^(x), \delta^*(x) = -\frac{\hat{f}(x)}{\|\nabla_\perp \hat{f}(x)\|^2}\,\nabla_\perp \hat{f}(x), with magnitude (the adversarial margin): R(x)f^(x)f^(x). R_\perp(x) \approx \frac{|\hat{f}(x)|}{\|\nabla_\perp \hat{f}(x)\|}.

Since f^(x)=O(1)\hat{f}(x) = O(1) while f^(x)ΓN1/2\|\nabla_\perp \hat{f}(x)\| \asymp \Gamma_N^{1/2}, the margin scales as ΓN1/2\Gamma_N^{-1/2}:

Adversarial Margin Scaling
For the minimum-norm Laplace kernel interpolant on a dd-dimensional manifold, E[R(x)]    {O(1),d<2,O((logn)1/2),d=2,O ⁣(n(d2)/(2d)),d>2. \mathbb{E}[R_\perp(x)] \;\asymp\; \begin{cases} O(1), & d \lt 2, \\ O((\log n)^{-1/2}), & d = 2, \\ O\!\left(n^{-(d-2)/(2d)}\right), & d > 2. \end{cases} For d>2d > 2, the adversarial margin vanishes polynomially in sample size.

Interestingly, more data makes the model less robust (for d>2d > 2) -- adding more data increases the spectral bandwidth of the interpolant, amplifies the normal gradients, and shrinks the adversarial margin.

Back to Benign Overfitting

This is complementary to the classical benign overfitting theory. For the Laplace kernel:

Yet robustness collapses as nn increases for d>2d > 2. This is not a contradiction -- it reflects the fact that generalization and robustness are fundamentally different geometric properties: Generalizationtangential gradient,Robustnessnormal gradient. \text{Generalization} \equiv \text{tangential gradient}, \qquad \text{Robustness} \equiv \text{normal gradient}.

Phase Diagram

Define the geometric instability ratio ρ2ΓNΓT, \rho^2 \doteq \frac{\Gamma_N}{\Gamma_T}, where ΓT=Exμ[Tf^(x)2]\Gamma_T = \mathbb{E}_{x \sim \mu}[\|\nabla_T \hat{f}(x)\|^2] is the tangential gradient energy. As nn \to \infty:

(Here we have borrowed Mallinar et al.'s terminology.)

Phase diagram for geometric instability across intrinsic dimension regimes
Geometric phase diagram across intrinsic dimension: benign for d<2d \lt 2, tempered at d=2d = 2, and catastrophic for d>2d \gt 2.

The same critical dimension d=2d = 2 governs the divergence of ΓN\Gamma_N, the spectral accumulation of high-frequency modes, and the vanishing of the adversarial margin RR_\perp. This is not a coincidence -- all three are manifestations of the same spectral phase transition.

TL;DR

The noise goes off-manifold. When a minimum-norm interpolant fits noisy labels, it preserves smoothness along the data manifold (enabling generalization) at the cost of encoding all the noise energy in the normal gradient (destroying robustness). In high intrinsic dimensions, this tradeoff is unavoidable: the geometry of kernel spaces forces noise into normal directions, and those directions grow increasingly unstable with more data.

Overfitting is geometrically benign only when d<2d \lt 2, which is quite restrictive in practice. For data on manifolds of moderate intrinsic dimension, interpolation will generalize but degrade adversarially with sample size -- regardless of how much data you add.

  1. Mallinar et al. distinguish three overfitting regimes for kernel interpolants: benign (memorizes noise but generalizes well), tempered (some degradation, but test error is bounded away from the trivial predictor), and catastrophic (test error is as bad as a trivial predictor). Their taxonomy is based on the scaling of bias and variance with sample size.
  2. This is also treated in a final project I did for Ben Recht's statistical learning theory course, which was largely motivated by this paper.
  3. The minimum-norm interpolant is the limit of kernel ridge regression as the regularization parameter λ0\lambda \to 0. It is the unique function in the RKHS Hk\mathcal{H}_k with minimum RKHS norm that interpolates the training data.
  4. This is a defining property of manifolds in differential geometry.
  5. Simultaneous diagonalizability of GG and KK is guaranteed for translation-invariant kernels on homogeneous manifolds, where both matrices commute in the Fourier basis. The spectral decomposition then cleanly separates across frequency modes.
  6. In fact, the logic we use makes no special use of the Laplace kernel, we only use the general structure, so this same logic should extend to any Matérn kernel (the Laplace is an instance of this with ν=1/2\nu = 1/2).
  7. The limits of the integral are actaully borrowed from the ultraviolet and infrared cutoffs in physics; Itzykson and Drouffe's Statistical Field Theory covers this in far more detail -- just something I thought was worth noting.
  8. The screening length ξnn1/(d+1)\xi_n \sim n^{-1/(d+1)} comes from the self-consistency equation for the Laplace RKHS equivalent kernel: (Δ+μ2)(d+1)/2h(x,xi)=δ(xxi)(-\Delta + \mu^2)^{(d+1)/2} h(x, x_i) = \delta(x - x_i), where the screening parameter satisfies μd+1ρ=n/V\mu^{d+1} \asymp \rho = n/V, with VV the volume of the manifold.