Theory
This page covers the mathematical background for laGP.jl.
Gaussian Process Basics
A Gaussian Process (GP) is a collection of random variables, any finite subset of which have a joint Gaussian distribution. A GP is fully specified by its mean function $m(x)$ and covariance function $k(x, x')$:
\[f(x) \sim \mathcal{GP}(m(x), k(x, x'))\]
Given training data $\{X, Z\}$ where $X$ is an $n \times m$ design matrix and $Z$ is an $n$-vector of responses, the GP posterior at test points $X_*$ is:
\[\begin{aligned} \mu_* &= k(X_*, X) K^{-1} Z \\ \Sigma_* &= k(X_*, X_*) - k(X_*, X) K^{-1} k(X, X_*) \end{aligned}\]
where $K = k(X, X) + g I$ is the training covariance matrix with nugget $g$.
Kernel Parameterization
Isotropic Squared Exponential
laGP uses the following parameterization:
\[k(x, y) = \exp\left(-\frac{\|x - y\|^2}{d}\right)\]
where $d$ is the lengthscale parameter.
KernelFunctions.jl uses: $k(x,y) = \exp(-\|x-y\|^2/(2\ell^2))$
The mapping is: $d = 2\ell^2$, so $\ell = \sqrt{d/2}$
Separable (ARD) Kernel
For separable/anisotropic GPs, each dimension has its own lengthscale:
\[k(x, y) = \exp\left(-\sum_{j=1}^{m} \frac{(x_j - y_j)^2}{d_j}\right)\]
This allows the model to capture different smoothness in different input directions.
Concentrated Likelihood
laGP uses the concentrated (profile) log-likelihood:
\[\ell = -\frac{1}{2}\left(n \log\left(\frac{\phi}{2}\right) + \log|K|\right)\]
where $\phi = Z^\top K^{-1} Z$.
This formulation profiles out the variance parameter, leaving only $d$ and $g$ to optimize.
Gradients
The gradient with respect to the nugget $g$:
\[\frac{\partial \ell}{\partial g} = -\frac{1}{2} \text{tr}(K^{-1}) + \frac{n}{2\phi} (K^{-1}Z)^\top (K^{-1}Z)\]
The gradient with respect to lengthscale $d$:
\[\frac{\partial \ell}{\partial d} = -\frac{1}{2} \text{tr}\left(K^{-1} \frac{\partial K}{\partial d}\right) + \frac{n}{2\phi} Z^\top K^{-1} \frac{\partial K}{\partial d} K^{-1} Z\]
where for the isotropic kernel:
\[\frac{\partial K_{ij}}{\partial d} = K_{ij} \frac{\|x_i - x_j\|^2}{d^2}\]
Inverse-Gamma Priors
MLE optimization uses Inverse-Gamma priors for regularization:
\[p(\theta) \propto \theta^{-a-1} \exp\left(-\frac{b}{\theta}\right)\]
where $a$ is the shape and $b$ is the scale parameter.
The log-prior and its gradient:
\[\begin{aligned} \log p(\theta) &= a \log b - \log\Gamma(a) - (a+1)\log\theta - \frac{b}{\theta} \\ \frac{\partial \log p}{\partial \theta} &= -\frac{a+1}{\theta} + \frac{b}{\theta^2} \end{aligned}\]
Default values: $a = 3/2$, with $b$ computed from the data-adaptive parameter ranges.
Local Approximate GP
For large datasets, building a full GP is computationally prohibitive ($O(n^3)$). Local approximate GP builds a small local model for each prediction point.
Algorithm
For each prediction point $x_*$:
- Initialize: Select $n_0$ nearest neighbors to $x_*$
- Iterate: Until local design has $n_{\text{end}}$ points:
- Evaluate acquisition function on all remaining candidates
- Add the point that maximizes (ALC) or minimizes (MSPE) the criterion
- Predict: Make prediction using the local GP
Active Learning Cohn (ALC)
ALC measures the expected reduction in predictive variance at reference points if a candidate point were added:
\[\text{ALC}(x) = \frac{1}{n_{\text{ref}}} \sum_{j=1}^{n_{\text{ref}}} \left[\sigma^2(x_{\text{ref},j}) - \sigma^2_{x}(x_{\text{ref},j})\right]\]
where $\sigma^2_x$ is the variance after adding point $x$ to the design.
Higher ALC values indicate points that would reduce prediction uncertainty more.
Mean Squared Prediction Error (MSPE)
MSPE combines current variance with expected variance reduction:
\[\text{MSPE}(x) = \frac{n+1}{n-1} \bar{\sigma}^2 - \frac{n+1}{n-1} \cdot \frac{n-2}{n} \cdot \text{ALC}(x)\]
where $\bar{\sigma}^2$ is the average predictive variance at reference points.
Lower MSPE values indicate better candidate points.
Nearest Neighbor (NN)
The simplest approach: just use the $n_{\text{end}}$ nearest neighbors. Fast but doesn't account for prediction location.
Prediction Variance Scaling
laGP scales the posterior variance using the Student-t distribution:
\[s^2 = \frac{\phi}{n} \left(1 + g - k_*^\top K^{-1} k_*\right)\]
This gives wider intervals than the standard GP formulation, accounting for hyperparameter uncertainty.
The degrees of freedom is $n$ (number of training points).
Data-Adaptive Hyperparameter Ranges
Lengthscale (darg)
The darg function computes ranges from pairwise distances:
- Start: 10th percentile of pairwise squared distances
- Min: Half the minimum non-zero distance (floored at $\sqrt{\epsilon}$)
- Max: Maximum pairwise distance
Nugget (garg)
The garg function computes ranges from response variability:
- Start: 2.5th percentile of squared residuals from mean
- Min: $\sqrt{\epsilon}$ (machine epsilon)
- Max: Maximum squared residual
References
Gramacy, R. B. (2016). laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R. Journal of Statistical Software, 72(1), 1-46.
Gramacy, R. B. and Apley, D. W. (2015). Local Gaussian Process Approximation for Large Computer Experiments. Journal of Computational and Graphical Statistics, 24(2), 561-578.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.