Every Model is Kernel Ridge Regression

Sep 13, 2025

Some notes on Domingos' paper: Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

Background

Over the past year or so, I've been working a lot with kernel ridge regression. It provides a lot of neat theoretical tools and insights (see e.g. this post), but much of the criticism of studying something like KRR is that it's rarely used anymore -- nearly all modern ML is deep learning, and KRR is not. Domingos' paper shows that every gradient descent learned algorithm is approximately a kernel machine¹, allowing a lot of the theoretical analysis that we have for KRR to be ported over to deep learning. In what follows, we'll first study the "richness scale" of neural networks and formally define the kernel regime (for more on the NTK, see this post), then go through Domingos' paper and reproduce the main proof.

The Richness Scale

Training a neural network can be broken down into two complementary processes: inference and backpropagation. In the inference stage, we get predictions from our model on some data; then in the backprop stage, we update our model's hidden representations to reduce its loss (some metric of how "wrong" our model's prediction was). Via some nice math², we may show that this entire process of training a neural network is dependent only on a single hyperparameter: the step size \(\norm{\Delta h}\) we use in gradient descent.

This step size governs where on a so-called "richness" scale a model lives. At one end, we have the lazy regime (NTK). Here, updates are infinitesimal and hidden representations barely move. The neural network behaves like a fixed feature map and training reduces to kernel ridge regression with the neural tangent kernel (NTK). All the expressive power of our model comes from the initialization, not any representation learning. On the other end, we have the rich regime (\(\mu\)P); here updates are large enough that hidden features evolve significantly during training. Now the network learns new representations, breaking away from the fixed NTK. A good statistical mechanics analogy is to think of the effective update size as a control parameter for our system. In the lazy regime, the model behaves almost as a fixed lattice, but in the rich regime, it's less rigid and more like a fluid.

Formally, we index the richness scale by how the step size compares to width. For a layer of width \(n\), if the effective learning rate scales like \(O(1/n)\), then activations barely move (lazy), but if it scales like \(O(1)\) representations evolve substantially (rich). If we take our model to the infinite-width limit, we recover the NTK, where our model is identically a kernel machine. In the next section, we show how even at intermediate richness or in the \(\mu\)P regime, we may approximate a model as a sequence of effective kernels.

Preliminaries

Kernel Machines

Before beginning the main result, it's a good idea to formally define kernel machines. Taken from Domingos, a kernel machine is a model of the form \begin{equation} \hat{y} = g\left(\sum_i a_i K(x, x_i) + b \right), \end{equation} where \(g\) is some nonlinearity, \((x_i)\) are training data points, \(x\) is our query data point, \(K\) is our kernel, which measures the similarity of its inputs, and \((a_i), b\) are learned parameters. Kernels can be predefined or learned, but the main idea is that they give us some comparison or notion of similarity between data points. From the above form, it is clear that we may view a kernel machine as a single-layer neural network, with the kernel as the nonlinearity.

Gradient Descent

It's also useful to recap gradient descent before moving on: gradient descent is an iterative algorithm used to train deep neural networks, in which we define a loss function \(\mathcal{L}(y_i, \hat{y}_i)\) which tells us how wrong our model's prediction \(\hat{y}_i\) is given the true answer \(y_i\). Now, given an initial parameter vector \(w_0\), gradient descent computes the loss as a function of the model's parameters (since \(\hat{y}_i\) is a function of the model's parameters at any given step \(w_s\), we may write \(\mathcal{L}(w_s)\)) and iteratively updates the parameter vector \(w\) by some step size \(\eta\) until we reach a point where the gradient is 0. Written out, we have \begin{equation} w_{s+1} = w_s - \eta \nabla_w \mathcal{L}(w_s). \end{equation}

Path Kernels

The result in the paper shows that the kernel machine resulting from gradient descent uses the path kernel, defined as \begin{equation} K(x, x^\prime) = \int_{c(t)} \nabla_w y(x) \cdot \nabla_w y(x^\prime) \ dt. \end{equation}

Here, \(c(t)\) is the path taken by the parameters during gradient descent, so the full path kernel \(K\) is simply the integral of the dot product of the model’s gradients at the two points over the path. Intuitively, this measures how similarly the model at the two data points varies during learning.

Derivation

Let us first define the tangent kernel: the tangent kernel associated with a function \(f_w(x)\) and parameter vector \(v\) is \begin{equation} K_{f,v}^g (x, x^\prime) = \nabla_w f_w(x) \cdot \nabla_w f_w(x^\prime), \end{equation} with the gradients taken at \(v\). Then, we may define the path kernel associated with function \(f_w(x)\) and curve \(c(t)\) is \begin{equation} K^p_{f,c}(x, x^\prime) = \int_{c(t)} K^g_{f,w(t)}(x, x^\prime) \ dt. \end{equation}

Theorem

Let us consider a model \(\hat{y} = f_w(x)\), where \(f\) is a differentiable function of \(w\), that is learned via gradient descent from data \(\{(x_i, y_i)\}_{i=1}^m\) with a differentiable loss function \(\mathcal{L} = \sum_i \mathcal{L}(y_i, \hat{y}_i)\) and learning rate \(\eta\). Then, \begin{equation} \lim_{\eta \to 0} \hat{y} = \sum_{i=1}^m a_i K(x, x_i) + b, \end{equation} where \(K\) is the path kernel associated with \(f_w(x)\) and the path taken by the parameters during gradient descent, \(a_i \doteq - \pdv{\mathcal{L}}{y_i}\) along the path weighted by the corresponding tangent kernel, and \(b\) is the initial model.

Proof

In the infinitesimal update limit (\(\eta \to 0\)), we recover the gradient flow ODE \begin{equation} \dv{w(t)}{t} = - \nabla_w \mathcal{L}(w(t)). \end{equation}

Then, for any differentiable function of the weights, we have by chain rule that \begin{equation} \dv{y}{t} = \sum_{j=1}^d \pdv{y}{w_j} \pdv{w_j}{t} = \sum_{j=1}^d \pdv{y}{w_j} \left(-\pdv{\mathcal{L}}{w_j}\right), \end{equation} where the second equality follows from the gradient flow ODE. Since the loss function is additive by construction, and invoking the chain rule, we may rewrite as \begin{equation} \dv{y}{t} = \sum_{j=1}^d \pdv{y}{w_j} \left(- \sum_{i=1}^d \pdv{\mathcal{L}}{y_i} \pdv{y_i}{w_j}\right), \end{equation} and rearranging terms gives \begin{equation} \dv{y}{t} = - \sum_{i=1}^d \pdv{\mathcal{L}}{y_i} \sum_{j=1}^d \pdv{y}{w_j} \pdv{y_i}{w_j}. \end{equation}

Defining \(\mathcal{L}^\prime (y_i, \hat{y}_i) \doteq \pdv{\mathcal{L}}{y_i}\) as the loss derivative for the \(i\)th output and invoking the definition of the tangent kernel, we have \begin{equation} \dv{y}{t} = - \sum_{i=1}^m \mathcal{L}^\prime (y_i, \hat{y}_i) K^{g}_{f, w(t)}(x, x_i). \end{equation}

Let \(y_0\) denote the initial model prior to gradient descent. Then, for the final model \(y\), \begin{equation} \lim_{\eta \to 0} y = y_0 - \int_{c(t)} \sum_{i=1}^m \mathcal{L}^\prime(y_i, \hat{y}_i) K^{g}_{f, w(t)}(x, x_i) \ dt. \end{equation}

If we multiply and divide by \(\int_{c(t)} K^{g}_{f, w(t)}(x, x_i) \ dt\), we have \begin{equation} \lim_{\eta \to 0} y = y_0 - \sum_{i=1}^m \left( \frac{\int_{c(t)} K^{g}_{f, w(t)}(x, x_i) \mathcal{L}^\prime(y_i, \hat{y}_i) \ dt}{\int_{c(t)} K^{g}_{f, w(t)}(x, x_i) \ dt}\right) \int_{c(t)} K^{g}_{f, w(t)}(x, x_i) \ dt. \end{equation}

Defining the average loss derivative weighted by similarity to \(x\) as \(\bar{\mathcal{L}}^\prime(y_i, \hat{y}_i) = \int_{c(t)} K^{g}_{f, w(t)}(x, x_i) \mathcal{L}^\prime(y_i, \hat{y}_i) \ dt / \int_{c(t)} K^{g}_{f, w(t)}(x, x_i) \ dt\) and substituting the definition of the path kernel yields \begin{equation} \lim_{\eta \to 0} y = y_0 - \sum_{i=1}^m \bar{\mathcal{L}}^\prime(y_i, \hat{y}_i) K^p_{f,c}(x, x_i) \end{equation} as claimed. Note that this follows exactly the form of a kernel machine, with \(K(x, x_i) \doteq K^p_{f,c}(x, x_i)\), \(a_i \doteq -\bar{\mathcal{L}}^\prime(y_i, \hat{y}_i)\), and \(b \doteq y_0\).

Interpretation

The proof of the above is a rather simple and straightforward result, but has many important implications. Firstly and most importantly, it justifies the use of kernel methods in the analysis of deep learning models. It tells us that regardless of where models lie on the richness scale, they behave as kernel machines; either described by a fixed Neural Tangent Kernel in the lazy regime, or by an evolving path kernel in the intermediate and rich regimes. On a broader scale, the result presented tells us that there exists a deep theoretical connection between kernel machines and deep learning: all deep learning models are continuously interpolating between different kernel machines.

References

I use the term KRR somewhat broadly, in reality KRR is a specific algorithm within a family of kernel methods; this family are referred to as kernel machines. ↩
The "nice math" is shown in Sec. 2.2 of arXiv:2404.19719, linked in references. (This tutorial goes far more in depth and is a really good reference on the richness scale.) ↩