Reproducing Kernel Hilbert Spaces

math

Jun 03, 2025

Building up RKHSs from the Riesz Representation Theorem

Motivation

Reproducing Kernel Hilbert Spaces (RKHSs) are a major deal in machine learning, especially in the study of kernel methods and SVMs. RKHSs are a natural extension of Hilbert spaces to function spaces and give us several useful results in statistical learning theory. We'll first prove the Riesz Representation Theorem and then use it to prove the representer theorem for RKHSs, and show that any kernel has an RKHS.

The Riesz Representation Theorem (for Hilbert Spaces)

Statement

Let \(\mathcal{H}\) be a Hilbert space. Then for every continuous linear functional \(f \in \mathcal{H}^*\) (the dual space of \(\mathcal{H}\)), there exists a unique element \(v \in \mathcal{H}\) such that \(f(u) = \langle u, v \rangle\) for all \(u \in \mathcal{H}\).

Parsing this statement is a bit intimidating, especially without a bit of background in linear algebra or functional analysis. Essentially, the RRT states that every function \(f\) that takes a vector \(u \in \mathcal{H}\) and returns a scalar in a way that is linear and continuous is always expressible as the inner product of that vector with some fixed other vector \(v \in \mathcal{H}\), and this vector \(v\) is unique and thus defines the linear functional \(f\).

Proof

The proof of the RRT is pretty standard and I'm sure it can be found in most linear algebra textbooks (see e.g. Axler's Linear Algebra Done Right), but it is included here for completeness.

Let \(\mathcal{H}\) be a Hilbert space over \(\mathbb{R}\) or \(\mathbb{C}\), and let \(f\) be a continuous linear functional on \(\mathcal{H}\). We wish to show that there exists a unique \(v \in \mathcal{H}\) such that \(f(u) = \langle u, v \rangle\) for all \(u \in \mathcal{H}\).

First, let's consider the trivial case \(f = 0\). In this case, choosing \(v = \vec{0}\) satisfies our requirement and we are done. So, going forward, we assume \(f \neq 0\).

Let \(\ker f = \{u \in \mathcal{H} \mid f(u) = 0\}\). This is a closed subspace of \(\mathcal{H}\) because \(f\) is continuous. Since \(f \neq 0\), we have \(\ker f \neq \mathcal{H}\), so there exists \(w \in \mathcal{H} \setminus \ker f\).

Let \(v = \text{proj}_{\ker f^\perp}(w)\), the orthogonal projection of \(w\) onto the orthogonal complement of \(\ker f\). Then \(v \in \ker f^\perp\) and \(v \neq 0\) since \(w \notin \ker f\). Define \begin{equation} v_0 = \frac{\overline{f(v)}}{\|v\|^2}v. \end{equation}

Any \(x \in \mathcal{H}\) can be uniquely written as \(x = u + \alpha v\) with \(u \in \ker f\) and \(v \in \ker f^\perp\). If \(x \in \ker f\), then \(\alpha = 0\). By linearity of \(f\), we have \begin{equation} f(x) = f(u + \alpha v) = f(u) + \alpha f(v) = \alpha f(v), \end{equation} and \begin{equation} \langle x, v_0 \rangle = \langle u + \alpha v, \frac{\overline{f(v)}}{\|v\|^2}v \rangle = \langle u, \frac{\overline{f(v)}}{\|v\|^2}v \rangle + \alpha \frac{\overline{f(v)}}{\|v\|^2} \langle v, v \rangle = \alpha \overline{f(v)}. \end{equation} So if \(f(x) = \alpha f(v)\) and \(\langle x, v_0 \rangle = \alpha \overline{f(v)}\), these are equal since \(f(v) = \overline{f(v)}\) if \(\mathcal{H}\) is real, and otherwise by definition of \(v_0\).

Thus, \(f(x) = \langle x, v_0 \rangle\) for all \(x \in \mathcal{H}\).

To show uniqueness, suppose there exists another vector \(v_1 \in \mathcal{H}\) such that \(f(x) = \langle x, v_1 \rangle\) for all \(x \in \mathcal{H}\). Then, \begin{equation} \langle x, v_0 - v_1 \rangle = \langle x, v_0 \rangle - \langle x, v_1 \rangle = f(x) - f(x) = 0 \end{equation} for all \(x \in \mathcal{H}\). Setting \(x = v_0 - v_1\) gives \begin{equation} \|v_0 - v_1\|^2 = \langle v_0 - v_1, v_0 - v_1 \rangle = 0, \end{equation} which implies \(v_0 = v_1\). Thus, the vector \(v_0\) is unique.

Reproducing Kernel Hilbert Spaces

Kernels

Before diving into RKHSs, we must first define a kernel. A kernel acts as a similarity measure between data points. More precisely, a kernel is a function \(K : X \times X \to \mathbb{R}\) (or \(\mathbb{C}\)) such that, for any pair of inputs \(x, x^\prime \in X\), the value \(K(x, x^\prime)\) gives some notion of similarity between \(x\) and \(x^\prime\).

A kernel must satisfy the following two properties:

Symmetry: \(K(x, x^\prime) = K(x^\prime, x)\)
Positive semi-definiteness: For any finite set of points \(x_1, \dots, x_n \in X\), the kernel matrix \begin{equation} K_{ij} := K(x_i, x_j) \end{equation} must be symmetric and positive semi-definite. In other words, for any \(c_1, \dots, c_n \in \mathbb{R}\), \begin{equation} \sum_{i,j=1}^n c_i c_j K(x_i, x_j) \geq 0. \end{equation}

Kernels can be thought of as implicitly defining a feature space \(\mathcal{F}\) and a mapping \(\phi : X \to \mathcal{F}\) s.t. \begin{equation} K(x, x^\prime) = \langle \phi(x), \phi(x^\prime) \rangle_{\mathcal{F}}. \end{equation} Kernels are useful since we don't need to know \(\phi\) explicitly (often referred to as the kernel trick). Via kernels, we may compute inner products in high or infinite dimensional feature spaces without ever leaving our original input space.

RKHSs

Now we can define a Reproducing Kernel Hilbert Space. An RKHS is a Hilbert space of functions where evaluation at a point is a continuous linear operation and this evaluation can be "reproduced" via an inner product with a kernel function.

Formally, let \(X\) be a nonempty set and \(\mathcal{H}\) be a Hilbert space of functions \(f: X \to \mathbb{R}\). Then, \(H\) is a reproducing kernel hilbert space if there exists a function \(K: X \times X \to \mathbb{R}\) so that

For every fixed \(x \in X\), \(K(x, \cdot) \in \mathcal{H}\) and
\(\forall f \in \mathcal{H}, x \in X, f(x) = \langle f, K(x, \cdot) \rangle_\mathcal{H}\)

The second property (the reproducing property) is the more interesting one that we really care about. It effectively tells us that evaluating a function is the same as taking an inner product between the function and a specific element of the space (\(K(x, \cdot)\)). Equivalently, it says that the kernel functions serve as "representatives" of point evaluation.

This result allows us to apply tools from Hilbert space theory to function evaluation and learning. Since kernels are defined without reference to a specific feature space, we can build function spaces with nice properties (e.g. smoothness, regularity) by choosing an appropriate kernel.

Representer Theorem

Now, we introduce the Representer Theorem, one of the most important results from statistical learning theory. In a nutshell, it explains why kernel methods are tractable, despite involving very high or infinite dimensional function spaces.

Setup

Let's assume we wish to solve a problem of the form \begin{equation} \min_{f \in \mathcal{H}} \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \norm{f}^2_\mathcal{H}. \end{equation} Here, \(\mathcal{H}\) is a reproducing kernel Hilbert space with kernel \(K\), \(\{(x_i, y_i)\}_{i=1}^n\) are our training data, \(L: \mathbb{R} \times \mathbb{R} \to \mathbb{R}\) is a loss function, and \(\lambda > 0\) is a regularization parameter penalizing the complexity of the function \(f\).

Statement

Let \(\mathcal{H}\) be a reproducing kernel hilbert space with kernel \(K\) and suppose the objective depends on \(f\) only via its evaluations at the training points and its RKHS norm. Then, any minimizer \(f^ \in \mathcal\(H\)\) can be represented in the form \begin\(equation\) f^ = \sum_\(i=1\)^n \alpha_i K(x_i, x) \end\(equation\) for coefficients \(\alpha_1, \dots, \alpha_n \in \mathbb{R}\). Even though \(\mathcal{H}\) may be an infinite dimensional space of functions, the optimal solution lies in the finite-dimensional subspace spanned by the kernel functions centered at the training data: \(\{K(x_i, \cdot)\}_{i=1}^n\). Thus, instead of searching over all function space, we need only search over the \(n\) coefficients \(\alpha_1, \dots, \alpha_n\), and we have reduced an infinite dimensional problem into a finite dimensional one.

Proof

Let \(\mathcal{H}_0 \doteq \operatorname{span}\{K(x_1, \cdot), \dots, K(x_n, \cdot)\} \subseteq \mathcal{H}\) a finite dimensional subspace of \(\mathcal{H}\). Since \(\mathcal{H}\) is a Hilbert space, we can uniquely decompose any function \(f \in \mathcal{H}\) as \begin{equation} f = f_0 + f_\perp, \end{equation} where \(f_0 \in \mathcal{H}_0\) and \(f_\perp \in \mathcal{H}_0^\perp\).

Recall the reproducing property: \begin{equation} f(x_i) = \langle f, K(x_i, \cdot)\rangle_\mathcal{H} = \langle f_0, K(x_i, \cdot)\rangle + \langle f_\perp, K(x_i, \cdot)\rangle \end{equation} where the second equality follows from the decomposition of \(f\) above. Since \(f_\perp \in \mathcal{H}_0^\perp\) and \(K(x_i, \cdot) \in \mathcal{H}_0\), we have \(\langle f_\perp, K(x_i, \cdot)\rangle = 0\), so \begin{equation} f(x_i) = \langle f_0, K(x_i, \cdot)\rangle, \end{equation} i.e. the evaluation of \(f\) at the data points depend only on \(f_0\).

Via Pythagorean identity, we have \(\norm{f_0 + f_\perp}_\mathcal{H}^2 = \norm{f_0}_\mathcal{H}^2 + \norm{f_\perp}_\mathcal{H}^2\) and we can rewrite our objective function as \begin{equation} \min_{f \in \mathcal{H}} \sum_{i=1}^n L(y_i, f(x_i)) + \lambda (\norm{f_0}_\mathcal{H}^2 + \norm{f_\perp}_\mathcal{H}^2). \end{equation} Note that the function evaluation at the data points is independent of \(f_\perp\), but it still appears in the regularization term. Since we want to minimize the complexity of the function, it is optimal to set \(f_\perp = 0\) and thus the minimizer \(f^* \in \mathcal{H}_0\). Since \(\mathcal{H}_0 \doteq \operatorname{span}\{K(x_1, \cdot), \dots, K(x_n, \cdot)\} \subseteq \mathcal{H}\), it follows immediately that \(f\) is a linear combination of \(K(x_1, \cdot), \dots, K(x_n, \cdot)\).

Example

Kernel Ridge Regression

Let \(K \in \mathbb{R}^{n \times n}\) be the kernel matrix with entries \(K_{ij} = K(x_i, x_j)\), and let \(\mathbf{y} \in \mathbb{R}^n\) be the vector of targets. Then: \begin{equation} \boldsymbol{\alpha} = (K + \lambda I)^{-1} \mathbf{y} \end{equation}

Moore-Aronszajn Theorem

So far, we've shown that RKHSs have several nice properties. Now, we show that they are guaranteed to exist for a kernel.

Statement

Let \(K: X \times X \to \mathbb{R}\) be an SPSD (symmetric positive semi-definite) function. Then, there exists a unique RKHS \(\mathcal{H}_K\) of functions \(f: X \to \mathbb{R}\) such that

For every fixed \(x \in X\), \(K(x, \cdot) \in \mathcal{H}_K\) and
\(\forall f \in \mathcal{H}_K, x \in X, f(x) = \langle f, K(x, \cdot) \rangle_{\mathcal{H}_K}\)

In other words, we don't need to start with a function space and find a kernel the kernel itself builds its own Reproducing Kernel Hilbert Space.

Construction

The RKHS \(\mathcal{H}_K\) can be explicitly constructed from the kernel \(K\) as follows:

RKHS Construction

Let \(K : X \times X \to \mathbb{R}\) be a symmetric, positive semi-definite kernel. Define a function space: \begin{equation} \mathcal{H}_K := \left\{ f(x) = \sum_{i=1}^n \alpha_i K(x_i, x) \,\middle|\, \alpha_i \in \mathbb{R},\, x_i \in X \right\}. \end{equation} This has all finite linear combinations of the kernel evaluated at points in \(X\). Define an inner product on \(\mathcal{H}_K\): \begin{equation} \left\langle \sum_i \alpha_i K(x_i, \cdot), \sum_j \beta_j K(x_j, \cdot) \right\rangle_{\mathcal{H}_K} := \sum_{i,j} \alpha_i \beta_j K(x_i, x_j). \end{equation} This is a pre-Hilbert space. Complete the space. (This is done by including all limit points of Cauchy sequences under the induced norm by the inner product). This guarantees that \(K(x, \cdot) \in \mathcal{H}_K\) and that for all \(f \in \mathcal{H}_K\), \begin{equation} f(x) = \langle f, K(x, \cdot) \rangle_{\mathcal{H}_K} \end{equation} as desired.