Sparse GPs: approximate the posterior, not the model

Lots of people like to use Gaussian process (GP) models. Even the most expert deep learning researchers will use a GP for small numbers of data points (Bengio on Quora).

Date:

July 26, 2016

Sparse GPs: approximate the posterior, not the model.

Lots of people like to use Gaussian process (GP) models. Even the most expert deep learning researchers will use a GP for small numbers of data points (Bengio on Quora). For larger datasets, researchers have spent years trying to make reasonable approximations so that they can use GPs in those settings too: trading off some accuracy for computability. A question I often get asked is: which GP approximation should I use -- there are so many!? This post attempts to describe a little history of how some approximations are related, and describe why I prefer to use the variational GP approximation.

Sparseness, pseudo-inputs and inducing points

This area of research is often covered by the catch-all term ‘sparse GPs’. The term stems from some of the original work in this area, where methods used a subset of the dataset to approximate the kernel matrix (Williams and Seeger 2000). Sparseness occurred through selection of the datapoints. So, given a dataset of inputs X = {x_i}_{i=1}^N , x_i \in \mathcal X, and responses (targets) Y = {y_i}_{i=1}^N, the task became to select a smaller dataset \mathbf Z \subset X and \mathbf Y \subset \mathbf Y, and to compute the Gaussian process on that dataset instead.

Then, Ed Snelson and Zoubin Ghahramani (2006) had a great idea which was to generalize the set of selected points so that they did not have to be part of the dataset. They called these new points ‘pseudo inputs’, since z_i \in \mathcal X, but not necessarily z_i \in \mathbf X. A neat thing about this approach is that the positions of pseudo-inputs could be optimized using gradient-based methods in order to improve the likelihood. This idea stuck: the method was renamed ‘FITC’ (fully independent training conditional) by (Quiñonero-Candela and Rasmussen 2006), and has been the workhorse of sparse Gaussian processes ever since.

The approximate priors

A nice thing about the Quiñonero-Candela paper was that it provided an overview of all the sparse (or pseudo-input) GP methods at the time, and showed that they could all be interpreted as alternative models. If we consider the Gaussian process model as a prior on functions, then many sparse GPs can be considered as alternative priors on functions.

The problem with this is that it conflates the model and the approximation. It’s hard to know whether to attribute performance (or problems) to one or the other. For example, we sometimes see that FITC outperforms standard GP regression. This might sound like a bonus, but it’s actually rather problematic. It causes problems for the user because they are no longer able to criticise the model.

Suppose we fit a model using FITC, and it performs less well than desired. Should we modify the model (switch to a different kernel) or should we modify the approximation (increase the number of inducing points)? Equally, suppose we fit a model using FITC and it performs really well. Is that because we have made an excellent choice of kernel, or because pathologies in the FITC approximation are hiding mis-specifications in the model?

I can give a concrete example of both of these cases. First, in replicating the ‘Generalized FITC’ paper (Naish-Guzman and Holden 2007), Ricardo Andrade (now at UCSF) found issues replicating the results on the paper. The model seemed to overfit. After much investigation, it turned out that stopping the optimization after only 100 iterations (as specified in Naish-Guzman’s article) made the method work. The FITC approximation conflated with early stopping, making it hard to know what was wrong. Second, the FITC approximation gives a heteroscedastic effect (see Ed Snelson’s thesis for discussion). This seems to help on some datasets: but how are we to know if a dataset needs a heteroscedastic model? If I want such a model, I want to specify it, understand it, and to subject it to model comparison and criticism.

There is an accumulation of evidence that the FITC method overfits. Alex Matthews has a tidy case study in his thesis for the regression case. The FITC approximation will give you the real posterior if the inducing points are placed at the data points, but optimizing the positions of the inducing points won’t necessarily help. Quiñonero-Candela’s explanation of FITC as an alternative model (prior) explains why: the inducing points are parameters of a very large model, and optimizing those parameters can lead to overfitting.

The approximate posteriors

In 2009, Michalis Titsias published a paper which proposed a different approach: “Variational Learning of Inducing Variables in Sparse Gaussian Processes” (Titsias 2009). This method doesn’t quite fit into the unifying view proposed by Quiñonero-Candela. The idea is to construct a variational approximation to the posterior process, and learn the pseudo-points alongside the kernel parameters by maximising the evidence lower bound (ELBO). There was quite a bit of prior art on variational inference in Gaussian processes (e.g. Csató and Opper 2002; Seeger 2003): Titsias’ main contribution was to show that the pseudo-input points could be optimized in a rigorous way.

Now we understand that the variational method proposed by Titsias (and other related methods, e.g. Opper and Archambeau 2009) is minimizing the Kullback-Leibler (KL) divergence between the GP approximation and the true posterior GP, where the KL is defined across the whole process (Matthews et al 2016). This gives us a solid foundation on which to build approximations to the posterior of many Gaussian process models, including GP classification, GP state-space models, and more.

The construction is as follows. If the GP prior is:

p( f(.) ) = GP(0, k(., .))

And the likelihood is:

p(y | f(.) ) = \prod_i p(y_i | f(x_i) )

Then we cactive learning to determine which datapoints were the most useful (see e.g. the informative vector machine, Lawrence et al 2003)n give an approximation to the posterior as:

q(f(.) | y) = GP( \mu(.), \sigma(., .))

Where mu and sigma are functions that depend on the pseudo-points and maybe other parameters too. Then, we ‘just’ do variational Bayes. The beauty of this is that it’s totally clear that improving the ELBO will improve the approximation: we’re free to do whatever we want with mu and sigma. Almost. There are two tricky points:

It’s not allowed to write p(f(.)) or q(f(.)). This would be okay if the GP was defined on a finite input domain (so that f was just a multivariate normal vector). Alex Matthews showed how this can be extended to infinite domains (like the reals) in his AISTATS paper (Matthews et al 2016), and with more discussion in his thesis (Matthews 2017). In summary: most of the time it’s not a problem.
To be able to compute the ELBO, we do need to pick particular forms for mu and sigma. The most straight-forward way to do this is

mu(.) = k(., Z) k(Z, Z)^{-1} m

sigma(., .) = k(., .) - k(., Z)[k(Z, Z)^{-1} - k(Z, Z)^{-1}\Sigmak(Z, Z)^{-1}]k(Z, .)

Where m and \Sigma and Z are variational parameters to be maximised. Other constructions are allowed, see e.g. (Hensman et al 2017).

These expressions appear superficially similar to the approximate-prior approach, but the underpinning idea is where most of the value lies. What we’re doing is Variational Bayes with the whole Gaussian process. Some immediate advantages include:

The approximation is nonparametric. This means that behaviour of the predictions away from the data takes the same form as the true posterior (usually increasing predictive variance away from the data).
We can always add more inducing points: since the solution with n pseudo points could always be captured by the one with n+1 pseudo points, the quality of the approximation is monotonically increasing in the number of pseudo points.
The pseudo-inputs \mathbf Z (and the number of them!) are parameters of the approximation, not parameters of the model. We can apply whatever strategy we want to the optimization of \mathbf Z: if the ELBO increases, then the approximation must be better (in the KL sense).
It’s clear what to do at predict time. The whole process is included in the approximation, so we simply have to evaluate the approximate posterior GP to make a prediction. This has been unclear in the past, when some authors have made the distinction of ‘extra approximations’ being required for prediction.

Outlook

Variational Bayes gives a solid foundation for research in approximate Gaussian processes. VB is making a serious comeback at the moment: stochastic variational inference lets us apply VB to large datasets; recognition models let us make efficient representations of approximate posteriors in latent variable models, and normalizing flows let us make the approximating posteriors very flexible. In my view, these technologies are all possible because VB genuinely turns inference into optimization: a better ELBO (evidence lower bound) is always better. I’m looking forward to seeing more GP models being solved by VB as new ideas filter into the GP community. Equally, I’m looking forward to GPs becoming ‘the standard’ building block for modelling functions within models as the variational approximation fits within wider inference schemes.

References

Yoshua Bengio on Quora:

https://www.quora.com/What-are-some-advantages-of-using-Gaussian-Process-Models-vs-Neural-Networks/answer/Yoshua-Bengio?share=1b76fe57

Williams, Christopher KI, and Matthias Seeger. "Using the Nyström method to speed up kernel machines." Proceedings of the 13th International Conference on Neural Information Processing Systems. MIT press, 2000.

Lawrence, Neil, Matthias Seeger, and Ralf Herbrich. "Fast sparse Gaussian process methods: The informative vector machine." Advances in neural information processing systems (2003): 625-632.

Snelson, Edward, and Zoubin Ghahramani. "Sparse Gaussian processes using pseudo-inputs." Advances in neural information processing systems 18 (2006): 1257.

Quiñonero-Candela, Joaquin, and Carl Edward Rasmussen. "A unifying view of sparse approximate Gaussian process regression." Journal of Machine Learning Research 6.Dec (2005): 1939-1959.

Naish-Guzman, Andrew, and Sean B. Holden. "The Generalized FITC Approximation." NIPS. 2007.

Csató, Lehel, and Manfred Opper. "Sparse on-line Gaussian processes." Neural computation 14.3 (2002): 641-668.

Seeger, Matthias. "Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations." (2003).

Csató, L., Fokoué, E., Opper, M., Schottky, B., & Winther, O. (2000). Efficient approaches to gaussian process classification. In Advances in neural information processing systems (pp. 251-257)

Bauer, Matthias, Mark van der Wilk, and Carl Edward Rasmussen. "Understanding probabilistic sparse gaussian process approximations." Advances in Neural Information Processing Systems. 2016.

Matthews, A. G. D. G., et al. "On sparse variational methods and the Kullback-Leibler divergence between stochastic processes." Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics. 2016.

Opper, Manfred, and Cédric Archambeau. "The variational Gaussian approximation revisited." Neural computation 21.3 (2009): 786-792.

Bauer, Matthias, Mark van der Wilk, and Carl Edward Rasmussen. "Understanding probabilistic sparse Gaussian process approximations." NIPS 2016.

Bui, Thang D., Josiah Yan, and Richard E. Turner. "A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation." arXiv preprint arXiv:1605.07066 (2016)