1 Introduction
Gaussian Processes (GPs) are nonparametric models that can be used to address machine learning problems, including regression and classification
[19]. GPs become more expressive as the number of training instances grows and, since they are Bayesian models, they provide a predictive distribution that estimates the uncertainty associated to the predictions made. This uncertainty estimation or ability to know that is not known becomes critical in many practical applications [7]. Nevertheless, GPs suffer from poor scalability as their training cost is due to the need of computing the inverse of a covariance matrix of size . Another limitation is that approximate inference is required with nonGaussian likelihoods [19].To improve the cost of GPs, sparse approximations can be used [19]. The most popular ones introduce a set of inducing points [25, 26]. The inducing points and their associated posterior values completely specify the posterior process at test points. In the method described in [25], the computational gain is obtained by assuming independence among the process values at the training points given the inducing points and their values. This approach can also be seen as using an approximate GP prior [18]. By contrast, in the method in [26] the computational gain is obtained by combining variational inference (VI) with a posterior approximation that has a fixed part and a tunable part. In both methods the computational cost is reduced to and the inducing points, considered as model’s hyperparameters, are learned by maximizing an estimate of the marginal likelihood.
Importantly, the VI approach in [26] maximizes a lower bound on the logmarginal likelihood as an indirect way of minimizing the KLdivergence between an approximate posterior distribution for the process values associated to the inducing points and the corresponding exact posterior. The advantage of this approach is that the objective is expressed as a sum over the training instances, allowing for minibatch training and stochastic optimization techniques to be applied on the objective [11]. This reduces the total training cost to , making GPs scalable to very large datasets.
When using sparse approximations, however, one often observes in practice that after the optimization process the inducing points are located in those regions of the input space in which the latent function changes [25, 26, 12, 2]. Therefore, the expressive power of the model critically depends on the number of inducing points and on its correct placement across the input space. For example, some problems may require a large number of inducing points, in the order of several hundreds, to get good prediction results [11, 23, 27]. This makes difficult using sparse GPs based on inducing points in those problems.
In the literature there have been some attempts to improve the computational cost of sparse approximations. These approaches follow different strategies, including using different sets of inducing points for the computation of the posterior mean and variance
[3]. Other approaches use an orthogonal decomposition of the GP that allows to introduce an extra set of inducing points with less cost [23]. Finally, other methods consider a large set of inducing points, but restrict the computations for a particular data point to the nearest neighbors to that point from the set of inducing points [27].In this work we are inspired by the approach of [27] and propose a novel method to improve the computational cost of sparse GPs. Our method also tries to produce a set of inducing points (and associated variational approximation ) that are specific of each different input data point, as in [27]. For that, we note that some works in the literature have observed that one can learn the mappings from inputs to proposal distributions instead of directly optimizing their parameters [15, 24]. This approach, known as amortized variational inference, is a key contribution of variational autoencoders (VAE) [15], and has also been explored in the context of GP to solve other types of problems such as multiclass classification with input noise [28]. Amortized inference has also been empirically shown to lead to useful regularization properties that improve the generalization performance [24].
Specifically, here we propose to combine sparse GP with a neural network architecture to compute the inducing points locations associated to each input point. Moreover, we also employ a neural network to carry out amortized VI to compute the parameters of the approximate variational distribution modeling the posterior distribution associated to the values of the outputted inducing points. Critically, this approach allows to reduce the number of inducing points drastically without loosing expressive power, as now we have different sets of inducing points associated to each input location. The inducing points are simply given by a mapping from the inputs provided by a neural network. We show on several experiments that the proposed method is able to perform similar or better than standard sparse GPs and competitive methods for improving the cost of sparse GPs [27, 23]. However, the training and prediction times of our method are much better.
2 Gaussian processes
A Gaussian Process (GP) is a stochastic process for which any finite set of variables has a Gaussian distribution
[19]. GPs are nonlinear and nonparametric approaches for regression and classification. In a learning task, we use a GP as a prior over a latent function. Then, Bayes’ rule is used to get a posterior for that function given the observed data. Consider a dataset , where each scalar is assumed to be obtained by adding an independent and identically distributed Gaussian noise with a zero mean and variance to a function evaluated on , i.e., , where . We specify a prior distribution for in the form of a GP, which is described by a mean function (often set to zero) and covariance function such that . Covariance functions typically have some adjustable parameters . The GP formulation allows to compute a posterior or predictive distribution for the potential values of given . More precisely, the the prediction at a new test point is Gaussian with mean and variance given by(1)  
(2) 
where and are the prediction mean and variance, respectively.
is a vector with the covariances between
and each . Similarly, has the covariances between and for . Finally,stands for the identity matrix. A popular covariance function
is the squared exponential covariance function, which assumes that is smooth [19]. Its parameters, , and can simply be found via typeII maximum likelihood estimation by maximizing [19]. Note, however, that the computational complexity of this approach is since it needs the inversion of , a matrix. This makes GPs unsuitable for large data sets.2.1 Sparse variational Gaussian processes
Sparse Gaussian process improve the computational cost of GPs. The most popular approaches introduce, in the same input space as the original data, a new set of points , called the inducing points, denoted by [25, 26]. Let the corresponding latent functions be . The inducing points are not restricted to be part of the observed data and their location can be learned during training. A GP prior is placed on . Namely, , where is a matrix with the covariances associated to each pair of points from . The main idea of sparse approximations is that the posterior for can be approximated in terms of the posterior for .
In this work we focus on a widely used variational inference (VI) approach to approximate the posterior for [26]. Let . In VI, the goal is to find an approximate posterior for and , , that resembles as much as possible the true posterior . Critically, is constrained to be , with fixed and a tunable multivariate Gaussian. To find a lower bound of the marginal likelihood is maximized. This bound is obtained through Jensen’s inequality, leading to the following expression, after some simplifications:
(3) 
where is the model’s likelihood and
is the KullbackLeibler divergence between probability distributions. In
[26], they do optimize in closedform. The resulting expression is then maximized to estimate , and . This leads to a complexity of . However, if the variational posterior is optimized alongside with , and , as proposed in [10], the ELBO can be expressed as a sum over training instances, which allows for minibatch training and stochastic optimization techniques. Using stochastic variational inference (SVI) reduces the training cost to [10]. Importantly, the first term in (3) is an expectation that has closedform solution in the case of Gaussian likelihoods. It needs to be approximated for other cases, e.g., binary classification, either by quadrature or MCMC methods [11]. The second term is the KLdivergence between the variational posterior and the prior, which can be computed analytically since they are both Gaussian.After optimizing (3), one often observes that the inducing points are located in those regions of the input space in which the latent function changes [26, 12, 2]. In consequence, the expressive power of the sparse GP critically depends on the number of inducing points and on its correct placement. In some learning problems, however, a large number of inducing points, in the order of several hundred, is required to get good prediction results [11, 23, 27]. This makes difficult and expensive using sparse GPs in those problems. In the next section we describe how to reduce the cost of this method.
3 Input dependent sparse GPs
We develop a novel formulation of sparse GPs, which for every given input computes the corresponding inducing points to be used for prediction, and also the associated parameters of the approximate distribution . To achieve this goal we consider a metapoint that is used to determine the inducing points and the corresponding . Namely, now depends on , i.e., . In particular, we set where the inducing points depend nonlinearly, e.g., via a deep neural network, on
. The joint distribution of
and is then given by for some prior distribution over . Following [27], we can consider an implicit distribution . That is, its analytical form is unknown, but we can draw samples from it. Later on, we will specify .Note that the marginalized prior is not anymore Gaussian. However, we can show that this formulation does not impact on the prior over . For an arbitrary selected metapoint we have that
(4) 
where are the crosscovariances between and . Therefore, if is marginalized out in (4), the prior for is the standard GP prior and does not depend on . Hence, . This means that is a mixture of Gaussian densities, where the marginal over is the same for every component of the mixture.
Note, however, that in the standard sparse GP, the inducing points also have an impact on the variational approximation via the fixed conditional distribution [26]. Therefore, we also have to incorporate the input dependence on in the approximation . This is done in the next section.
3.1 Lower bound on the logmarginal likelihood
We follow [27] to derive a lower bound on the logmarginal likelihood of the extended model described above. Consider a posterior approximation of the form , where only can be adjusted and the other factors are fixed. Using the above explanation and the help of Jensen’s inequality we obtain the lower bound after some simplifications:
(5) 
Now, assuming that is an implicit distribution, we can draw samples from it and approximate the expectation with respect to . Thus, for a metapoint sample from the lower bound is approximated as
(6) 
Note that we can evaluate (6) and its gradients to maximize the original objective in (5) using stochastic optimization techniques. This is valid for any implicit distribution . Consider now that we use minibatchbased training for optimization, and we set . In this case, the value of remains random, as it depends on the points that are selected in the random minibatch. This results in a method that computes a different set of inducing points associated for each input location.
3.2 Amortized variational inference and deep neural networks
Maximizing the lower bound finds the optimal approximate distribution . A problem, however, is that we have a potential large number of parameters to fix, corresponding to each . In particular, if we set to be Gaussian, we will have to infer different means and covariance matrices for each different . This is expected to be memory inefficient and to make difficult optimization.
To reduce the number of parameters of our method we use amortized variational inference and specify a function that can generate these parameters for each [24]. More precisely, we set the mean and covariance matrix of to be and , for some nonlinear functions.
Deep neural networks (DNN) are very flexible models that can describe complicated nonlinear functions. In these models, the inputs go through several layers of nonlinear transformations. We use these models to compute the nonlinearities that generate from
the inducing points, , and the means and covariances of , i.e., and . The architecture employed is displayed in Figure 1. At the output of the DNN we obtain , a mean vector and the Cholesky factor of the covariance matrix . The maximization of the lower bound in (5) when using DNNs for the nonlinearities is shown in Algorithm 1. The required expectations are computed in closedform, in regression. In binary classification, we use 1dimensional quadrature, as in [12].3.3 Predictions and computational cost
At test time, however, the inputs of interest are not randomly chosen. In that case, we simply set to be a deterministic distribution placed on the candidate test point . The DNN is used to obtain the associated relevant information. Namely, , and the parameters of , and . The predictive distribution for is:
(7) 
Given this distribution for
, the probability distribution for
can be computed in closed form in the case of regression problems and with 1dimensional quadrature in the case of binary classification.The cost of our method is smaller than the cost of a standard sparse GP if a smaller number of inducing points is used. The cost of a DNN with layers, hidden units, dimension of the inputs data, and output dimension is . The cost of the sparse GPs is , with the minibatch size. Therefore, the cost of our method is . Since in our method the inducing points are input dependent, we expect to obtain good prediction results even for values that are fairly small.
4 Related work
Early works on sparse GPs simply choose a subset of the training data for inference based on an information criterion [4, 16, 22, 9]. This approach is limited in practice and more advanced methods in which the inducing points need not be equal to the training points are believed to be superior. In the literature there are several works analyzing and studying sparse GP approximations based on inducing points. Some of these works include [18, 25, 17, 26, 2, 13]. We focus here on a variational approach [26] which allows for stochastic optimization and minibatch training [10, 12]. This enables learning in very large datasets with a cost of , with the number of inducing points.
In general, however, a large number of inducing variables is desirable to obtain a good approximation to the posterior distribution. For some problems, even several hundreds of inducing points may be needed to get good prediction results [11, 23, 27]. There is hence a need to improve the computational cost of sparse GP approximations, without losing expressive power. One work addressing this task is that of [3]. In that work it is proposed to decouple the process of inferring the posterior mean and variance, allowing to consider a different number of inducing points for each one. Importantly, the computation of the mean achieves a linear complexity, which allows to have more expressive posterior means at a lower cost. A disadvantage is that such an approach suffers from optimization difficulties. An alternative decoupled parameterization that adopts an orthogonal basis in the mean is proposed in [20]. Such a method can be considered as a particular case of [23]. There, it is introduced a new interpretation of sparse variational approximations for GP using inducing points. For this, the GP is decomposed as a sum of two independent processes. This leads to tighter lower bounds on the marginal likelihood and new inference algorithms that allow to consider two different sets of inducing points. This enables including more inducing points at a linear cost, instead of cubic.
Our work is closer to [27]. There, it is also described a mechanism to consider input dependent inducing points in the context of sparse GP. However, the difference is significant. In particular, in [27] a very large set of inducing points is considered initially. Then, for each input point, a subset of these inducing points is considered. This subset is obtained by finding the nearest inducing points to the current data instance . This approach significantly reduces the cost of the standard sparse GP described in [26]. However, it suffers from the difficulty of having to find the nearest neighbors for each point in a minibatch, which is very expensive. Therefore, the final cost is higher than what would be thought initially. Our method is expected to be better because of the extra flexibility provided by the nonlinear relation between and the inducing points computed by the DNN. Furthermore, the DNN can take advantage of GPU acceleration.
A different way of improving the computational cost of GP is described in [29, 6, 8]. The approach consists in placing the inducing points on a grid structure. This allows to perform fast computation exploiting the inducing points structure. One can easily consider values for that are even larger than . However, to get such benefits the inducing points need to be fixed due to the structure constraints. This may be detrimental in problems with a high input dimensions.
Finally, the use of amortized variational inference [24] has also been explored in the context of GPs in [28]. There, input noise is considered in a multiclass learning problem and the variational parameters of the posterior approximation for the noiseless inputs are amortized using a DNN receiving both and . Amortized VI is shown to improve the generalization performance of the resulting model.
5 Experiments
We evaluate the performance of the proposed method, to which we refer to as Input Dependent Sparse GP (IDSGP). We consider both regression and binary classification with a probit likelihood. In this later case, we approximate the expectation in the lower bound using 1dimensional quadrature, as in [12]
. An implementation of the proposed method in Tensorflow 2.0 is provided in the supplementary material
[1]. In the experiments we compare results with the standard variational sparse GP [26]. We refer to such a method as VSGP. We also compare results with two of the methods described in Section 4. Namely, the sparse within sparse GP (SWSGP) described in [27], and the sparse GP based on an orthogonal decomposition that allows to consider two different sets of inducing points [23]. We refer to this last method as SOLVE. All methods use a Matérn covariance function [19].5.1 Toy problems
A first set of experiments illustrates the resulting predictive mean and standard deviation of each method on a 1dimensional regression problem
[25]. Each method is compared with a full GP. Figure 2 shows the results obtained, including the learned locations of the inducing points. In the case of IDSGP we show the locations of the inducing points for the point represented with a star at . The number of inducing points, for each method, are indicated in the figure’s caption. IDSGP uses smaller number of inducing points than the other methods. The figure shows that, in regions with observed data, IDSGP outputs a mean and a standard deviation that look closer to those of the full GP.5.2 Experiments on UCI datasets
A second set of experiments considers several regression and binary classification datasets extracted from the UCI repository [5]. In the regression setting, the DNN architecture used for IDSGP has hidden layer with units. The number of inducing points of IDSGP is set to . In SOLVE we use and inducing points. In VSGP we set . In SWSGP we set and neighbors. All the methods are trained using ADAM [14] with a minibatch size of and a learning rate of . In the classification setting we use the same setup, but the number of inducing points of IDSGP is set to . All methods are trained on a Tesla P100 GPU with 16GB of memory. On each dataset we use of the data for training and the rest for testing. We report results across splits of the data since the datasets are already quite big.
The average neg. test loglikelihood of each method on each dataset is displayed in Table 1, for the regression datasets, and in Table 2, for the classification datasets, respectively. The average rank of each method is also displayed at the last row of each table. The RMSE and prediction accuracy results are similar to those displayed here. They can be found in the supplementary material. Each table also shows the number of instances and dimensions of each dataset. We observe that in the regression datasets, the proposed method, IDSGP, obtains best results in out of the datasets. IDSGP also obtains the best average rank (closer to always performing best on each train / test data split). This is remarkable given that IDSGP a much smaller number of inducing points (e.g., for IDSGP vs. for VSGP). In classification, however, all the methods seem to perform similar to each other and the differences between them are smaller. Again IDSGP uses here a smaller number of inducing points. Increasing this number does not seem to improve the results.
VSGP  SOLVE  SWSGP  IDSGP  

Kin40k  32,000  8  0.047 (0.003)  0.415 (0.006)  0.110 (0.007)  1.461 (0.019) 
Protein  36,584  9  2.848 (0.002)  2.818 (0.003)  2.835 (0.002)  2.775 (0.007) 
KeggDirected  42,730  19  1.955 (0.013)  1.756 (0.073)  2.256 (0.012)  2.410 (0.012) 
KEGGU  51,686  26  2.344 (0.012)  2.531 (0.015)  2.396 (0.006)  2.908 (0.042) 
3dRoad  347,899  3  3.691 (0.006)  3.726 (0.010)  3.879 (0.026)  3.399 (0.009) 
Song  412,276  90  3.613 (0.003)  3.608 (0.002)  3.618 (0.004)  3.637 (0.002) 
Buzz  466,600  77  6.272 (0.012)  6.297 (0.009)  6.137 (0.008)  6.317 (0.055) 
HouseElectric  1,639,424  6  1.737 (0.006)  1.743 (0.005)  1.711 (0.010)  1.774 (0.004) 
Avg. Ranks  3.125 (0.125)  2.475 (0.156)  2.850 (0.150)  1.550 (0.172) 
Avg. neg. test loglikelihood values for the UCI regression datasets. The numbers in parentheses are standard errors. Best mean values are highlighted in bold face.
VSGP  SOLVE  SWSGP  IDSGP  

MagicGamma  15,216  10  0.308 (0.004)  0.314 (0.005)  0.371 (0.005)  0.311 (0.002) 
DefaultOrCredit  24,000  30  0.000 (0.000)  0.000 (0.000)  0.000 (0.000)  0.000 (0.000) 
NOMAO  27,572  174  0.113 (0.004)  0.103 (0.004)  0.134 (0.004)  0.121 (0.004) 
BankMarketing  36,169  51  0.206 (0.001)  0.199 (0.001)  0.304 (0.021)  0.209 (0.002) 
Miniboone  104,051  50  0.151 (0.001)  0.142 (0.001)  0.180 (0.007)  0.153 (0.001) 
Skin  196,046  3  0.005 (0.000)  0.005 (0.000)  0.006 (0.001)  0.003 (0.000) 
Crop  260,667  174  0.003 (0.000)  0.003 (0.000)  0.002 (0.000)  0.003 (0.000) 
HTSensor  743,193  11  0.003 (0.001)  0.001 (0.000)  0.030 (0.009)  0.005 (0.001) 
Avg. Ranks  2.425 (0.143)  1.775 (0.158)  3.175 (0.182)  2.625 (0.155) 
In these experiments we also measure the average training time per epoch, for each method. The results corresponding to the UCI regression datasets are displayed in Table
5. The results for the UCI classification datasets are found in the supplementary material. They look very similar to ones reported here. We observe that the fastest method in terms of training time is the proposed approach. Namely, IDSGP. Nevertheless, the speedup obtained is impaired by the overhead of having to compute the output of the DNN and update its parameters. IDSGP also results in fastest prediction times than VSGP, SOLVE or SWSGP. See the supplementary material for further details.Kin40k  Protein  KeggDirected  KEGGU  3dRoad  Song  Buzz  HouseElectric  

VSGP  591.7 (0.58)  737.2 (1.16)  932.7 (2.56)  1128.1 (3.78)  7880.9 (66.79)  9777.7 (42.84)  9901.0 (146.07)  32784.2 (190.18) 
SOLVE  1739.3 (0.45)  2015.9 (0.66)  2357.3 (1.70)  2909.1 (1.19)  19567.1 (10.34)  23196.6 (98.35)  25769.5 (20.12)  92214.9 (452.18) 
SWSGP  875.7 (0.68)  1023.5 (0.35)  1220.6 (1.89)  1458.0 (5.57)  10203.4 (12.03)  12241.7 (62.01)  13371.5 (12.34)  46163.3 (427.23) 
IDSGP  190.3 (0.75)  371.5 (1.25)  533.0 (1.73)  693.7 (5.77)  4070.1 (201.09)  4296.5 (25.03)  3640.4 (33.36)  16352.2 (90.15) 
5.3 Large scale datasets
A last set of experiments considers two very large datasets. The first dataset is the Airlines Delay binary classification dataset, as described in [13], with data instances and attributes. The second dataset is the Yellow taxi dataset, as described in [21], with datapoints and attributes. In each dataset we use a test set of instances chosen at random. The DNN architecture for IDSGP has 2 hidden layers with 25 units each. The number of inducing points is set to be equal to in this method. In the other methods, we use the same number of inducing points as in the previous section. The minibatch size is set to . Training is also performed on the same GPU as in the previous section. The ADAM learning rate is set to .
The average negative test loglikelihood of each method on the test set is displayed in Figure 4, for each dataset. We report performance in terms of the training time, in a scale. The results corresponding to the RMSE are very similar to the ones displayed here. They can be found in the supplementary material. We observe that the proposed method IDSGP performs best on each dataset. In particular it is able to obtain a better performance in a smaller computational time. We believe this is a consequence of using a smaller number of inducing points, and also because of the extra flexibility that the DNN provides for specifying the locations of the inducing points.
6 Conclusions
Gaussian processes (GPs) are flexible models for regression and classification. However, they have a cost of with the number of training points. Sparse approximations based on inducing points reduce such a cost to . A problem, however, is that in some situations a large number of inducing points have to be used in practice, since they determine the flexibility of the resulting approximation. There is hence a need to reduce their computational cost.
We have proposed here input dependent sparse GP (IDSGP), a method that can improve the training time and the flexibility of sparse GP approximations. IDSGP uses a deep neural network (DNN) to output specific inducing points for each point at which the predictive distribution of the GP needs to be computed. The DNN also outputs the parameters of the corresponding variational approximation on the inducing values associated to the inducing points. IDSGP can be obtained under a formulation that considers an implicit distribution for the input instance to the DNN. Importantly, such a formulation is shown to keep intact the GP prior on the latent function values associated to the training points.
The extra flexibility provided by the DNN allows to significantly reduce the number of inducing points used in IDSGP. Such a model provides similar or better results than other sparse GP approximations from the literature at a smaller computational cost. IDSGP has been evaluated on several regression and binary classification problems from the UCI repository. The results obtained show that it improves the quality of the predictive distribution and reduces the computational cost of sparse GP approximations. Better results are most of the times obtained in regression problems. In classification problems, however, the performances obtained are similar to those of the stateoftheart, although the training and prediction times are always shorter. The scalability of IDSGP is also illustrated on massive datasets for regression and binary classification of up to billion points. There, IDSGP also obtains better results than alternative sparse GP approximations at a smaller training cost.
The authors gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, grant PID2019106827GBI00 / AEI / 10.13039/501100011033.
References
 [1] (2015) TensorFlow: largescale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §5.
 [2] (2016) Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29, pp. 1533–1541. Cited by: §1, §2.1, §4.
 [3] (2017) Variational inference for Gaussian process models with linear complexity. In Advances in Neural Information Processing Systems, pp. 5184–5194. Cited by: §1, §4.
 [4] (2002) Sparse online Gaussian processes. Neural Computation 14, pp. 641–668. Cited by: §4.
 [5] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.2.

[6]
(2018)
Scalable Gaussian processes with gridstructured eigenfunctions (GPGRIEF)
. In International Conference on Machine Learning, pp. 1417–1426. Cited by: §4. 
[7]
(2016)
Uncertainty in deep learning
. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §1. 
[8]
(2018)
Product kernel interpolation for scalable Gaussian processes
. InInternational Conference on Artificial Intelligence and Statistics
, pp. 1407–1416. Cited by: §4.  [9] (2012) Predictive active set selection methods for Gaussian processes. Neurocomputing 80, pp. 10–18. Cited by: §4.
 [10] (2013) Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pp. 282–290. Cited by: §2.1, §4, §5.1.
 [11] (2015) MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems 28, pp. 1648–1656. Cited by: §1, §1, §2.1, §2.1, §4.
 [12] (2015) Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics, pp. 351–360. Cited by: §1, §2.1, §3.2, §4, §5.
 [13] (2016) Scalable Gaussian process classification via expectation propagation. In Artificial Intelligence and Statistics, pp. 168–176. Cited by: §4, §5.3.
 [14] (2015) ADAM: a method for stochastic optimization. In Inrernational Conference on Learning Representations, pp. 1–15. Cited by: §5.2.
 [15] (2014) Autoencoding variational Bayes. In International Conference on Learning Representations, Cited by: §1.
 [16] (2003) Fast sparse Gaussian process methods: the informative vector machine. In Neural Information Processing Systems, pp. 609–616. Cited by: §4.
 [17] (2007) The generalized FITC approximation. Advances in neural information processing systems 20, pp. 1057–1064. Cited by: §4.
 [18] (2005) A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research 6, pp. 1939–1959. Cited by: §1, §4.
 [19] (2006) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press. Cited by: §1, §1, §2, §5.
 [20] (2018) Orthogonally decoupled variational Gaussian processes. In Advances in Neural Information Processing Systems, Vol. 31, pp. 8725–8734. Cited by: §4.
 [21] (2017) Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pp. 4588–4599. Cited by: §5.3.
 [22] (2003) Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pp. 254–261. Cited by: §4.
 [23] (2020) Sparse orthogonal variational inference for Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 1932–1942. Cited by: §1, §1, §1, §2.1, §4, §5.
 [24] (2018) Amortized inference regularization. In Advances in Neural Information Processing Systems, pp. 4393–4402. Cited by: §1, §3.2, §4.
 [25] (2006) Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems, pp. 1257–1264. Cited by: §1, §1, §2.1, §4, §5.1.
 [26] (2009) Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567–574. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §3, §4, §4, §5.
 [27] (2020) Sparse within sparse Gaussian processes using neighbor information. Note: 2011.05041 arXiv stat.ML Cited by: §1, §1, §1, §1, §2.1, §3.1, §3, §4, §4, §5.
 [28] (2021) Multiclass gaussian process classification with noisy inputs. Journal of Machine Learning Research 22, pp. 1–52. Cited by: §1, §4.
 [29] (2015) Kernel interpolation for scalable structured Gaussian processes (kissgp). In International Conference on International Conference on Machine Learning, pp. 1775–1784. Cited by: §4.
Appendix A Potential negative societal impacts
As for the potential negative societal impacts, because this paper focuses on the development of a new methodology, we believe these would be indirect through the particular application in which the proposed method is used. As one of the main advantages of GPs is that they provide uncertainty estimates associated with the predictions made, we think the potential harm of these models in society could arise in applications when this uncertainty estimation is critical. For example, an AI system in which the decisions made can have an influence on people’s life, such as autonomous vehicles or automatic medical diagnosis tools.
Appendix B Extra experimental results
In this section, we include some extra results that do not fit in the main manuscript. Namely, the RMSE in the test set results and prediction times for the UCI regression datasets, and the accuracy in the test set, training and prediction times for the UCI classification datasets. In both cases, the setup is the same as described in Section 5 and the results are similar that the ones obtained in terms of the negative test log likelihood and training times in that section. Finally, we include similar plots to those in Section 5.3 but in terms of the test RMSE for the Yellow Taxi dataset and in terms of the test classification error for the Airline Delays dataset.
b.1 UCI regression datasets
VSGP  SOLVE  SWSGP  IDSGP  

Kin40k  32,000  8  0.198 (0.002)  0.157 (0.001)  0.215 (0.002)  0.050 (0.002) 
Protein  36,584  9  4.161 (0.011)  4.062 (0.011)  4.133 (0.008)  3.756 (0.019) 
KeggDirected  42,730  19  0.032 (0.001)  0.079 (0.032)  0.024 (0.000)  0.022 (0.001) 
KEGGU  51,686  26  0.024 (0.000)  0.020 (0.000)  0.022 (0.000)  0.014 (0.000) 
3dRoad  347,899  3  9.641 (0.063)  10.020 (0.095)  11.726 (0.327)  7.250 (0.069) 
Song  412,276  90  8.966 (0.022)  8.925 (0.020)  9.013 (0.029)  9.068 (0.011) 
Buzz  466,600  77  175.076 (15.021)  173.352 (14.957)  160.744 (13.467)  166.784 (18.040) 
HouseElectric  1,639,424  6  0.035 (0.000)  0.034 (0.000)  0.036 (0.001)  0.032 (0.000) 
Avg. Ranks  3.075 (0.126)  2.400 (0.138)  3.025 (0.170)  1.500 (0.151) 
Kin40k  Protein  KeggDirected  KEGGU  3dRoad  Song  Buzz  HouseElectric  

VSGP  0.9(0.00)  1.1(0.00)  1.4(0.01)  1.7(0.01)  11.6(0.07)  14.5(0.08)  18.0(0.08)  48.3(0.12) 
SOLVE  2.4(0.00)  2.8(0.00)  3.2(0.00)  4.0(0.00)  27.0(0.04)  31.8(0.19)  37.0(0.03)  127.7(0.43) 
SWSGP  1.3(0.00)  1.5(0.00)  1.8(0.01)  2.1(0.01)  14.8(0.03)  17.6(0.02)  21.3(0.10)  66.2(0.88) 
IDSGP  0.3(0.00)  0.5(0.00)  0.7(0.00)  1.0(0.02)  5.7(0.26)  5.6(0.05)  8.1(0.08)  22.1(0.14) 
b.2 UCI classification datasets
VSGP  SOLVE  SWSGP  IDSGP  

MagicGamma  15,216  10  0.876 (0.001)  0.877 (0.002)  0.867 (0.002)  0.877 (0.002) 
DefaultOrCredit  24,000  30  1.000 (0.000)  1.000 (0.000)  1.000 (0.000)  1.000 (0.000) 
NOMAO  27,572  174  0.956 (0.002)  0.960 (0.001)  0.961 (0.001)  0.955 (0.001) 
BankMarketing  36,169  51  0.906 (0.001)  0.907 (0.001)  0.897 (0.001)  0.905 (0.001) 
Miniboone  104,051  50  0.941 (0.001)  0.945 (0.001)  0.938 (0.000)  0.937 (0.001) 
Skin  196,046  3  0.999 (0.000)  0.999 (0.000)  0.999 (0.000)  0.999 (0.000) 
Crop  260,667  174  0.999 (0.000)  0.999 (0.000)  0.999 (0.000)  0.999 (0.000) 
HTSensor  743,193  11  0.999 (0.000)  1.000 (0.000)  0.989 (0.003)  0.999 (0.000) 
Avg. Ranks  2.475 (0.127)  1.975 (0.152)  2.900 (0.187)  2.650 (0.168) 
Magic  DefaultOrCredit  NOMAO  BankMarket  Miniboone  Skin  Crop  HTSensor  

VSGP  3105(459)  4759(516)  4445(549)  6231(862)  18447(1279)  37835(7065)  49962(9292)  115463(17511) 
SOLVE  5154(1061)  7554(1039)  6718(1028)  8949(1635)  37022(7902)  64606(13314)  88819(18864)  168709(21194) 
SWSGP  1547(145)  2354(182)  2728(188)  3682(351)  10040(347)  20283(2796)  21770(3038)  67687(5880) 
IDSGP  1143(100)  1293(90)  2026(94)  2987(354)  7654(134)  15700(1918)  21378(2561)  53895(5652) 
MagicGamma  DefaultOrCredit  NOMAO  BankMarketing  Miniboone  Skin  Crop  HTSensor  

VSGP  3.6(0.56)  4.1(0.59)  4.4(0.74)  9.5(4.39)  17.9(1.84)  49.7(11.15)  58.7(12.82)  139.7(41.67) 
SOLVE  4.0(0.85)  5.4(0.73)  3.4(0.76)  4.9(1.24)  49.2(17.83)  48.8(16.73)  86.8(36.26)  93.5(15.73) 
SWSGP  3.0(0.43)  3.9(0.38)  4.3(0.51)  5.4(0.72)  16.4(1.82)  36.0(8.00)  33.9(6.25)  88.4(9.16) 
IDSGP  2.5(0.24)  2.5(0.21)  3.5(0.39)  4.8(0.54)  13.9(0.75)  26.1(4.92)  37.5(4.96)  83.4(8.23) 
Comments
There are no comments yet.