DR

Reproducing Kernel Hilbert Spaces

Jun 03, 2025

Building up RKHSs from the Riesz Representation Theorem

Motivation

Reproducing Kernel Hilbert Spaces (RKHSs) are a major deal in machine learning, especially in the study of kernel methods and SVMs. RKHSs are a natural extension of Hilbert spaces to function spaces and give us several useful results in statistical learning theory. We'll first prove the Riesz Representation Theorem and then use it to prove the representer theorem for RKHSs, and show that any kernel has an RKHS.

The Riesz Representation Theorem (for Hilbert Spaces)

Statement

Let H\mathcal{H} be a Hilbert space. Then for every continuous linear functional fHf \in \mathcal{H}^* (the dual space of H\mathcal{H}), there exists a unique element vHv \in \mathcal{H} such that f(u)=u,vf(u) = \langle u, v \rangle for all uHu \in \mathcal{H}.

Parsing this statement is a bit intimidating, especially without a bit of background in linear algebra or functional analysis. Essentially, the RRT states that every function ff that takes a vector uHu \in \mathcal{H} and returns a scalar in a way that is linear and continuous is always expressible as the inner product of that vector with some fixed other vector vHv \in \mathcal{H}, and this vector vv is unique and thus defines the linear functional ff.

Proof

The proof of the RRT is pretty standard and I'm sure it can be found in most linear algebra textbooks (see e.g. Axler's Linear Algebra Done Right), but it is included here for completeness.

Let H\mathcal{H} be a Hilbert space over R\mathbb{R} or C\mathbb{C}, and let ff be a continuous linear functional on H\mathcal{H}. We wish to show that there exists a unique vHv \in \mathcal{H} such that f(u)=u,vf(u) = \langle u, v \rangle for all uHu \in \mathcal{H}.

First, let's consider the trivial case f=0f = 0. In this case, choosing v=0v = \vec{0} satisfies our requirement and we are done. So, going forward, we assume f0f \neq 0.

Let kerf={uHf(u)=0}\ker f = \{u \in \mathcal{H} \mid f(u) = 0\}. This is a closed subspace of H\mathcal{H} because ff is continuous. Since f0f \neq 0, we have kerfH\ker f \neq \mathcal{H}, so there exists wHkerfw \in \mathcal{H} \setminus \ker f.

Let v=projkerf(w)v = \text{proj}_{\ker f^\perp}(w), the orthogonal projection of ww onto the orthogonal complement of kerf\ker f. Then vkerfv \in \ker f^\perp and v0v \neq 0 since wkerfw \notin \ker f. Define v0=f(v)v2v. v_0 = \frac{\overline{f(v)}}{\|v\|^2}v.

Any xHx \in \mathcal{H} can be uniquely written as x=u+αvx = u + \alpha v with ukerfu \in \ker f and vkerfv \in \ker f^\perp. If xkerfx \in \ker f, then α=0\alpha = 0. By linearity of ff, we have f(x)=f(u+αv)=f(u)+αf(v)=αf(v), f(x) = f(u + \alpha v) = f(u) + \alpha f(v) = \alpha f(v), and x,v0=u+αv,f(v)v2v=u,f(v)v2v+αf(v)v2v,v=αf(v). \langle x, v_0 \rangle = \langle u + \alpha v, \frac{\overline{f(v)}}{\|v\|^2}v \rangle = \langle u, \frac{\overline{f(v)}}{\|v\|^2}v \rangle + \alpha \frac{\overline{f(v)}}{\|v\|^2} \langle v, v \rangle = \alpha \overline{f(v)}. So if f(x)=αf(v)f(x) = \alpha f(v) and x,v0=αf(v)\langle x, v_0 \rangle = \alpha \overline{f(v)}, these are equal since f(v)=f(v)f(v) = \overline{f(v)} if H\mathcal{H} is real, and otherwise by definition of v0v_0.

Thus, f(x)=x,v0f(x) = \langle x, v_0 \rangle for all xHx \in \mathcal{H}.

To show uniqueness, suppose there exists another vector v1Hv_1 \in \mathcal{H} such that f(x)=x,v1f(x) = \langle x, v_1 \rangle for all xHx \in \mathcal{H}. Then, x,v0v1=x,v0x,v1=f(x)f(x)=0 \langle x, v_0 - v_1 \rangle = \langle x, v_0 \rangle - \langle x, v_1 \rangle = f(x) - f(x) = 0 for all xHx \in \mathcal{H}. Setting x=v0v1x = v_0 - v_1 gives v0v12=v0v1,v0v1=0, \|v_0 - v_1\|^2 = \langle v_0 - v_1, v_0 - v_1 \rangle = 0, which implies v0=v1v_0 = v_1. Thus, the vector v0v_0 is unique.

Reproducing Kernel Hilbert Spaces

Kernels

Before diving into RKHSs, we must first define a kernel. A kernel acts as a similarity measure between data points. More precisely, a kernel is a function K:X×XRK : X \times X \to \mathbb{R} (or C\mathbb{C}) such that, for any pair of inputs x,xXx, x^\prime \in X, the value K(x,x)K(x, x^\prime) gives some notion of similarity between xx and xx^\prime.

A kernel must satisfy the following two properties:

Kernels can be thought of as implicitly defining a feature space F\mathcal{F} and a mapping ϕ:XF\phi : X \to \mathcal{F} s.t. K(x,x)=ϕ(x),ϕ(x)F. K(x, x^\prime) = \langle \phi(x), \phi(x^\prime) \rangle_{\mathcal{F}}. Kernels are useful since we don't need to know ϕ\phi explicitly (often referred to as the kernel trick). Via kernels, we may compute inner products in high or infinite dimensional feature spaces without ever leaving our original input space.

RKHSs

Now we can define a Reproducing Kernel Hilbert Space. An RKHS is a Hilbert space of functions where evaluation at a point is a continuous linear operation and this evaluation can be "reproduced" via an inner product with a kernel function.

Formally, let XX be a nonempty set and H\mathcal{H} be a Hilbert space of functions f:XRf: X \to \mathbb{R}. Then, HH is a reproducing kernel hilbert space if there exists a function K:X×XRK: X \times X \to \mathbb{R} so that

The second property (the reproducing property) is the more interesting one that we really care about. It effectively tells us that evaluating a function is the same as taking an inner product between the function and a specific element of the space (K(x,)K(x, \cdot)). Equivalently, it says that the kernel functions serve as "representatives" of point evaluation.

This result allows us to apply tools from Hilbert space theory to function evaluation and learning. Since kernels are defined without reference to a specific feature space, we can build function spaces with nice properties (e.g. smoothness, regularity) by choosing an appropriate kernel.

Representer Theorem

Now, we introduce the Representer Theorem, one of the most important results from statistical learning theory. In a nutshell, it explains why kernel methods are tractable, despite involving very high or infinite dimensional function spaces.

Setup

Let's assume we wish to solve a problem of the form minfHi=1nL(yi,f(xi))+λfH2. \min_{f \in \mathcal{H}} \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \left\| f \right\|^2_\mathcal{H}. Here, H\mathcal{H} is a reproducing kernel Hilbert space with kernel KK, {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n are our training data, L:R×RRL: \mathbb{R} \times \mathbb{R} \to \mathbb{R} is a loss function, and λ>0\lambda > 0 is a regularization parameter penalizing the complexity of the function ff.

Statement

Let H\mathcal{H} be a reproducing kernel hilbert space with kernel KK and suppose the objective depends on ff only via its evaluations at the training points and its RKHS norm. Then, any minimizer f<em>Hf^<em> \in \mathcal{H} can be represented in the form f</em>=i=1nαiK(xi,x) f^</em> = \sum_{i=1}^n \alpha_i K(x_i, x) for coefficients α1,,αnR\alpha_1, \dots, \alpha_n \in \mathbb{R}. Even though H\mathcal{H} may be an infinite dimensional space of functions, the optimal solution lies in the finite-dimensional subspace spanned by the kernel functions centered at the training data: {K(xi,)}i=1n\{K(x_i, \cdot)\}_{i=1}^n. Thus, instead of searching over all function space, we need only search over the nn coefficients α1,,αn\alpha_1, \dots, \alpha_n, and we have reduced an infinite dimensional problem into a finite dimensional one.

Proof

Let H0span{K(x1,),,K(xn,)}H\mathcal{H}_0 \doteq \operatorname{span}\{K(x_1, \cdot), \dots, K(x_n, \cdot)\} \subseteq \mathcal{H} a finite dimensional subspace of H\mathcal{H}. Since H\mathcal{H} is a Hilbert space, we can uniquely decompose any function fHf \in \mathcal{H} as f=f0+f, f = f_0 + f_\perp, where f0H0f_0 \in \mathcal{H}_0 and fH0f_\perp \in \mathcal{H}_0^\perp.

Recall the reproducing property: f(xi)=f,K(xi,)H=f0,K(xi,)+f,K(xi,) f(x_i) = \langle f, K(x_i, \cdot)\rangle_\mathcal{H} = \langle f_0, K(x_i, \cdot)\rangle + \langle f_\perp, K(x_i, \cdot)\rangle where the second equality follows from the decomposition of ff above. Since fH0f_\perp \in \mathcal{H}_0^\perp and K(xi,)H0K(x_i, \cdot) \in \mathcal{H}_0, we have f,K(xi,)=0\langle f_\perp, K(x_i, \cdot)\rangle = 0, so f(xi)=f0,K(xi,), f(x_i) = \langle f_0, K(x_i, \cdot)\rangle, i.e. the evaluation of ff at the data points depend only on f0f_0.

Via Pythagorean identity, we have f0+fH2=f0H2+fH2\left\| f_0 + f_\perp \right\|_\mathcal{H}^2 = \left\| f_0 \right\|_\mathcal{H}^2 + \left\| f_\perp \right\|_\mathcal{H}^2 and we can rewrite our objective function as minfHi=1nL(yi,f(xi))+λ(f0H2+fH2). \min_{f \in \mathcal{H}} \sum_{i=1}^n L(y_i, f(x_i)) + \lambda (\left\| f_0 \right\|_\mathcal{H}^2 + \left\| f_\perp \right\|_\mathcal{H}^2). Note that the function evaluation at the data points is independent of ff_\perp, but it still appears in the regularization term. Since we want to minimize the complexity of the function, it is optimal to set f=0f_\perp = 0 and thus the minimizer fH0f^* \in \mathcal{H}_0. Since H0span{K(x1,),,K(xn,)}H\mathcal{H}_0 \doteq \operatorname{span}\{K(x_1, \cdot), \dots, K(x_n, \cdot)\} \subseteq \mathcal{H}, it follows immediately that ff is a linear combination of K(x1,),,K(xn,)K(x_1, \cdot), \dots, K(x_n, \cdot).

Example

Kernel Ridge Regression
Let KRn×nK \in \mathbb{R}^{n \times n} be the kernel matrix with entries Kij=K(xi,xj)K_{ij} = K(x_i, x_j), and let yRn\mathbf{y} \in \mathbb{R}^n be the vector of targets. Then: α=(K+λI)1y \boldsymbol{\alpha} = (K + \lambda I)^{-1} \mathbf{y}

Moore-Aronszajn Theorem

So far, we've shown that RKHSs have several nice properties. Now, we show that they are guaranteed to exist for a kernel.

Statement

Let K:X×XRK: X \times X \to \mathbb{R} be an SPSD (symmetric positive semi-definite) function. Then, there exists a unique RKHS HK\mathcal{H}_K of functions f:XRf: X \to \mathbb{R} such that

In other words, we don't need to start with a function space and find a kernel the kernel itself builds its own Reproducing Kernel Hilbert Space.

Construction

The RKHS HK\mathcal{H}_K can be explicitly constructed from the kernel KK as follows:

RKHS Construction
Let K:X×XRK : X \times X \to \mathbb{R} be a symmetric, positive semi-definite kernel. Define a function space: HK:={f(x)=i=1nαiK(xi,x)|αiR,xiX}. \mathcal{H}_K := \left\{ f(x) = \sum_{i=1}^n \alpha_i K(x_i, x) \,\middle|\, \alpha_i \in \mathbb{R},\, x_i \in X \right\}. This has all finite linear combinations of the kernel evaluated at points in XX. Define an inner product on HK\mathcal{H}_K: iαiK(xi,),jβjK(xj,)HK:=i,jαiβjK(xi,xj). \left\langle \sum_i \alpha_i K(x_i, \cdot), \sum_j \beta_j K(x_j, \cdot) \right\rangle_{\mathcal{H}_K} := \sum_{i,j} \alpha_i \beta_j K(x_i, x_j). This is a pre-Hilbert space. Complete the space. (This is done by including all limit points of Cauchy sequences under the induced norm by the inner product). This guarantees that K(x,)HKK(x, \cdot) \in \mathcal{H}_K and that for all fHKf \in \mathcal{H}_K, f(x)=f,K(x,)HK f(x) = \langle f, K(x, \cdot) \rangle_{\mathcal{H}_K} as desired.