Gaussian Processes

Gaussian Processes#

Example Problem: Regression#

To make the core ideas here concrete, we consider the problem of trying to learn about a function \(y = f(x)\) where \(f: \mathbb{R}\rightarrow \mathbb{R}\) given some data \(D=\{(x_i, y_i)\} = (\vect{x}, \vect{y})\) distributed as a multivariate Gaussian distribution \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\) with probability distribution function (PDF)

\[\begin{gather*} p_{\N(\vect{\mu}, \mat{C})}(\vect{y}) = \frac{\exp\Bigl(-\tfrac{1}{2}(\vect{y}-\vect{\mu})^\T\mat{C}^{-1}(\vect{y}-\vect{\mu})\Bigr)} {\sqrt{\det\abs{2\pi\mat{C}}}}, \\ [\vect{\mu}]_i = \braket{[\vect{y}]_i}, \qquad [\mat{C}]_{ij} = \kappa(x_i, x_j). \end{gather*}\]

The following provide alternative, but ultimately equivalent, viewpoints of this problem:

Linear least-squares fitting. The idea here is to use a set of basis functions \(\phi_n(x)\) and express:

\[\begin{gather*} f_{\vect{a}}(x) = \sum_{n} a_n \phi_n(x). \end{gather*}\]

Using Bayes’ theorem, if we have a prior for the parameters \(\vect{a} \sim p(\vect{a})\), then our posterior is:

\[\begin{gather*} p(\vect{a}|D) \propto \overbrace{\mathcal{L}(\vect{a}|D)}^{p(D|\vect{a})}p(\vect{a}). \end{gather*}\]

Here the likelihood function \(\mathcal{L}(\vect{a}|D) = p(D|\vect{a})\) is the probability of obtaining the data \(D\) if the true parameter values were \(\vect{a}\). Given our assumption above that \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\), we have:

\[\begin{gather*} p(D|\vect{a}) = p_{\N(\vect{\mu}, \mat{C})}\Bigl(f_{\vect{a}}(\vect{x})\Bigr). \end{gather*}\]

If the priors are also distributed as a multivariate Gaussian distribution \(\vect{a} \sim \N(\vect{\mu}_a, \mat{C}_a)\), then we can compute the posterior:

\[\begin{gather*} p(\vect{a}|D) \propto \overbrace{\mathcal{L}(\vect{a}|D)}^{p(D|\vect{a})}p(\vect{a}). \end{gather*}\]
Gaussian Process.

Gaussian Processes#

In simple terms, a Gaussian process is just the generalization of a multivariate Gaussian distribution to infinite-dimensional spaces such as the space of functions. To say that a function \(f: \mathbb{R}\mapsto \mathbb{R}\) is distributed by a Gaussian process

\[\begin{gather*} f \sim \GP[m(x), \kappa(x, x')] \end{gather*}\]

means that, if \(f(x)\) is sampled at any finite set of points \(\vect{x}\) with \(y_i = f(x_i)\), then \(\vect{y}\) is distributed as multivariate Gaussian distribution \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\) with probability distribution function (PDF)

\[\begin{gather*} p_{\N(\vect{\mu}, \mat{C})}(\vect{y}) = \frac{\exp\Bigl(-\tfrac{1}{2}(\vect{y}-\vect{\mu})^\T\mat{C}^{-1}(\vect{y}-\vect{\mu})\Bigr)} {\sqrt{\det\abs{2\pi\mat{C}}}}, \\ [\vect{\mu}]_i = \braket{[\vect{y}]_i}, \qquad [\mat{C}]_{ij} = \kappa(x_i, x_j). \end{gather*}\]

I.e. \(\vect{\mu}\) is the vector of mean values and \(\mat{C}\) is the covariance matrix. The functions \(m(x)\) and \(\kappa(x, x')\) are simply the infinite-dimensional 1representations of these in the continuum limit.

Regression#

Gaussian processes can be used to perform a generalized linear regression for the function \(y=f(x)\). Consider the aforementioned tabulation \(D=\{(x_i, y_i)\}\) to be “data” from some experiment with correlated Gaussian errors \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\). We can use Bayes’ theorem to update some prior distribution \(f\sim p(f)\):

\[\begin{gather*} p(f|D) \propto \mathcal{L}(f|D)p(f), \qquad \mathcal{L}(f|D) = p(D|f) = p_{\N(\vect{\mu}, \mat{C})}\bigl(f(\vect{x})\bigr). \end{gather*}\]

If the prior is Gaussian process \(f \sim \GP[m(x), \kappa(x, x')]\), then we can compute the posterior analytically.

Details

To see how this works, consider a function \(y_n = f(x_n)\) tabulated at a set of \(N\) points \(x_{n \in \{0, 1, \dots N-1\}}\) with a Gaussian prior \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\). Now consider data \(D\) consisting of a few points \(x_{i \in \{0, 1, M-1\}}\) with experimental errors \(\vect{y}_D \sim \N(\vect{\mu}_D, \mat{C}_D)\). Note, we have not specified which points are where, so we have organized the abscissa so that the experimental data are the first \(M\) points, which need not be adjacent.

We can now express \(\mat{C}\) is block diagonal form with the first block corresponding to the indices where we have data:

\[\begin{gather*} \mat{C} = \begin{pmatrix} \mat{C}_{00} & \mat{C}_{01}\\ \mat{C}_{10} & \mat{C}_{11}. \end{pmatrix} \end{gather*}\]

Organize the points so that the first Consider

Background#

Properties of Gaussian Distributions#

Here is a quick summary of the relevant properties of multivariate Gaussian distributions. See also Random Variables for additional details.

PDF#

The probability distribution function (PDF) for a vector \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\) of variables \(\{y_i\}\) distributed as a multivariate Gaussian distribution (or multivariate normal distribution) is:

\[\begin{gather*} p_{\N(\vect{\mu}, \mat{C})}(\vect{y}) = \frac{\exp\Bigl(-\tfrac{1}{2}(\vect{y}-\vect{\mu})^\T\mat{C}^{-1}(\vect{y}-\vect{\mu})\Bigr)} {\sqrt{\det\abs{2\pi\mat{C}}}}. \end{gather*}\]

Here \(\vect{\mu}\) is the vector of mean values, and \(\mat{C}\) is the covariance matrix.

Product of PDFs#

The product of the PDFs of multivariate Gaussian distributions with symmetric covariances \(\mat{C} = \mat{C}^\dagger\) is a multivariate Gaussian distribution (but not normalized):

\[\begin{align*} p_{\N(\vect{\mu}_c, \mat{C}_c)}(\vect{y}) & \propto p_{\N(\vect{\mu}_a, \mat{C}_a)}(\vect{y})p_{\N(\vect{\mu}_b, \mat{C}_b)}(\vect{y}), \\ \mat{C}_{c}^{-1} &= \mat{C}_{a}^{-1} + \mat{C}_{b}^{-1}, \\ \mat{C}_{c}^{-1}\vect{\mu}_c &= \mat{C}_{a}^{-1}\vect{\mu}_a + \mat{C}_{b}^{-1}\vect{\mu}_b. \end{align*}\]

I.e. the sums of the inverse covariance matrices add.

Marginal Distribution#

Partition a multivariate Gaussian distribution \(\vect{y} \sim \N_{\vect{y}}(\vect{\mu},\mat{C})\) as

\[\begin{align*} \vect{y} &= \begin{pmatrix} \vect{y}_a\\ \vect{y}_b\\ \end{pmatrix}, & \vect{\mu} &= \begin{pmatrix} \vect{\mu}_a\\ \vect{\mu}_b \end{pmatrix}, & \mat{C} &= \begin{pmatrix} \mat{C}_a & \mat{C}_{ab}\\ \mat{C}_{ba} & \mat{C}_b \end{pmatrix}, \end{align*}\]

where \(\mat{C}_{ba} = \mat{C}_{ab}^\dagger\) for standard distributions \(\mat{C} = \mat{C}^{\dagger}\). The marginal distributions for each partition (integrating over the other variables) are:

\[\begin{align*} \vect{y}_a &\sim \N_{\vect{y}_a}(\vect{\mu}_a, \mat{C}_{a}), \\ \vect{y}_b &\sim \N_{\vect{y}_b}(\vect{\mu}_b, \mat{C}_{b}). \end{align*}\]

I.e. the conditional covariance matrices are just the corresponding sub-blocks of the full covariance matrix.

Conditional Distributions#

Partition a multivariate Gaussian distribution \(\vect{y} \sim \N_{\vect{y}}(\vect{\mu},\mat{C})\) as

\[\begin{gather*} \mat{C}^{-1} = \begin{pmatrix} (\mat{C} / \mat{C}_{a})^{-1} & -(\mat{C}/\mat{C}_{b})^{-1}\mat{C}_{ab}\mat{C}_{b}^{-1}\\ -(\mat{C}/\mat{C}_{a})^{-1}\mat{C}_{ba}\mat{C}_{a}^{-1} & (\mat{C} / \mat{C}_{b})^{-1} \end{pmatrix}. \end{gather*}\]

where we have partitioned the inverse covariance here \(\mat{C}^{-1}\) using the Schur complements \(\mat{C} / \mat{C}_{a}\) and \(\mat{C} / \mat{C}_{b}\) of the block matrix \(\mat{C}\):

\[\begin{align*} \mat{C} / \mat{C}_{a} &\equiv \mat{C}_{b} - \mat{C}_{ba}\mat{C}_{a}^{-1}\mat{C}_{ab}, \\ \mat{C} / \mat{C}_{b} &\equiv \mat{C}_{a} - \mat{C}_{ab}\mat{C}_{b}^{-1}\mat{C}_{ba}. \end{align*}. :::{margin} Recall that the conditional distribution is obtained by fixing other variables: \begin{gather*} p(\vect{y}_a|\vect{y}_b) \propto p(\vect{y}). \end{gather*} This must be renormalized. ::: The conditional distributions (holding the other variables fixed) are: \begin{align*} \vect{y}_a &\sim \N_{\vect{y}_a}\Bigl( \vect{\mu}_a + \mat{C}_{ab}\mat{C}_b^{-1}(\vect{y}_b - \vect{\mu}_b),\; \mat{C} / \mat{C}_{a}\Bigr), \\ \vect{y}_b &\sim \N_{\vect{y}_b}\Bigl( \vect{\mu}_b + \mat{C}_{ba}\mat{C}_{a}^{-1}(\vect{y}_a - \vect{\mu}_a),\; \mat{C} / \mat{C}_{b}\Bigr). \end{align*}\]

I.e. the conditional inverse covariance matrices are just the corresponding sub-blocks of the full inverse covariance matrix.

Linear Combinations#

If \(\vect{x} \sim \N(\vect{\mu}_x, \mat{C}_x)\) and \(\vect{y} \sim \N(\vect{\mu}_y, \mat{C}_y)\) are independent, then \(\vect{z} = \mat{X}\vect{x} + \mat{Y}\vect{y}\) is distributed as:

\[\begin{gather*} \vect{z} = \mat{X}\vect{x} + \mat{Y}\vect{y} \sim \N\Bigl( \mat{X}\vect{\mu}_x + \mat{Y}\vect{\mu}_y,\; \mat{X}\mat{C}_x\mat{X}^\dagger + \mat{Y}\mat{C}_y\mat{Y}^\dagger \Bigr). \end{gather*}\]

Principle Components, Bases, and Metrics#

Consider a multivariate Gaussian distribution normal distribution \(\vect{y} \sim \N(\vect{\mu}, \mat{C})\). If we diagonalize the covariance matrix \(\mat{C} = \mat{U}\mat{D}\mat{U}^{\dagger}\):

\[\begin{gather*} \mat{C} = \overbrace{ \begin{pmatrix} \uvect{u}_0 & \uvect{u}_1 & \cdots \end{pmatrix} }^{\mat{U}} \overbrace{ \begin{pmatrix} \sigma_0^2 \\ & \sigma_1^2\\ & & \ddots \end{pmatrix}}^{\mat{D}} \overbrace{ \begin{pmatrix} \uvect{u}_0^\dagger \\ \uvect{u}_1^\dagger \\ \vdots \end{pmatrix}}^{\mat{U}^\dagger} = \begin{pmatrix} \smash{\overbrace{\uvect{u}_0\sigma_0}^{\vect{u}_0}} & \smash{\overbrace{\uvect{u}_1\sigma_1}^{\vect{u}_1}} & \cdots \end{pmatrix} \begin{pmatrix} \vect{u}_0^\dagger \\ \vect{u}_1^\dagger \\ \vdots \end{pmatrix} \end{gather*}\]

then the columns of \(\mat{U} = [\uvect{u}_0 \; \uvect{u}_1\; \cdots]\) form an orthonormal basis that we can rescale into a set of orthogonal vectors \(\vect{u}_n = \sigma_n\uvect{u}_{n}\) such that

\[\begin{gather*} \vect{y} = \vect{\mu} + \sum_{n} \xi_n\vect{u}_n, \qquad \xi_n \sim \N(0, 1) \end{gather*}\]

where \(\xi_n\) are independent random variables with a standard normal distribution (zero mean and unit covariance). The vectors \(\vect{u}_n\) or the pairs \((\uvect{u}_{n}, \sigma_n)\) are called the principle components. Note that this also follows from the property above of linear combinations of random variables (expressed as a sum of dyads):

\[\begin{gather*} \mat{C} = \sum_{n}\vect{u}_n\vect{u}_n^\dagger. \end{gather*}\]

A somewhat surprising result is that, if we relax the requirement that the vectors \(\vect{u}_n\) be orthogonal, then there are many similar decompositions of \(\vect{y}\) as a linear combination of independent standard normal variables:

\[\begin{gather*} \mat{C} = \sum_{n}\vect{a}_n\vect{a}_n^\dagger = \mat{A}\mat{A}^\dagger, \qquad \mat{A} = \begin{pmatrix} \vect{a}_0 & \vect{a}_1 & \cdots \end{pmatrix}, \end{gather*}\]

where the vectors \(\vect{a}_n\) are linearly independent, but need not be orthogonal.

To see this, construct the singular value decomposition (SVD) for \(\mat{A}\):

\[\begin{gather*} \mat{A} = \mat{U}\sqrt{\mat{D}}\mat{V}^\dagger, \qquad \mat{C} = \mat{A}\mat{A}^\dagger = \mat{U}\mat{D}\mat{U}^\dagger. \end{gather*}\]

If we are careful and arrange the singular values appropriately, then \(\mat{U}\) and \(\mat{D}\) here are the same as in the diagonalization of \(\mat{C}\), but the unitary matrix \(\mat{V}\) is arbitrary. Here we demonstrate numerically in 2D:

../_images/531c739c28aac41299375a90f8116baa48fef7a2fa29c244745bd31d241c32d0.png

This figure shows contours from a bivariate normal distribution with orthogonal principle components \(\vect{u}_0\) and \(\vect{u}_1\) shown as black vectors. Several equivalent independent but non-orthogonal basis vectors \(\vect{a}_0\) and \(\vect{a}_1\) are shown in different colors. These are generated by a one-dimensional parameterization of the unitary matrix \(\mat{V}(\theta)\). In this case, we can choose any direction \(\uvect{a}_0\), but must construct \(\vect{a}_1\) carefully.

References#

[Gelman et al., 2013]: A good comprehensive reference. Available for non-commercial from the author’s webpage.
[Melendez et al., 2019]: An application to nuclear theory with associated code gsum.
[Rasmussen and Williams, 2006]: Another nice book that formed the basis for the implementations in scikit-learn.

Gaussian Processes

Contents

Gaussian Processes#

Example Problem: Regression#

Gaussian Processes#

Regression#

Background#

Properties of Gaussian Distributions#

PDF#

Product of PDFs#

Marginal Distribution#

Conditional Distributions#

Linear Combinations#

Principle Components, Bases, and Metrics#

References#