Abstract:Learning the kernel parameters for Gaussian processes is often the computational bottleneck in applications such as online learning, Bayesian optimization, or active learning. Amortizing parameter inference over different datasets is a promising approach to dramatically speed up training time. However, existing methods restrict the amortized inference procedure to a fixed kernel structure. The amortization network must be redesigned manually and trained again in case a different kernel is employed, which leads to a large overhead in design time and training time. We propose amortizing kernel parameter inference over a complete kernel-structure-family rather than a fixed kernel structure. We do that via defining an amortization network over pairs of datasets and kernel structures. This enables fast kernel inference for each element in the kernel family without retraining the amortization network. As a by-product, our amortization network is able to do fast ensembling over kernel structures. In our experiments, we show drastically reduced inference time combined with competitive test performance for a large set of kernels and datasets.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the computational bottleneck problem in kernel function parameter learning in Gaussian Processes (GPs)**. Specifically, for application scenarios such as online learning, Bayesian optimization, or active learning, learning kernel function parameters is often a computational bottleneck. Existing methods learn these parameters through marginal likelihood maximization or evidence lower - bound maximization (ELBO), but these methods often require hundreds of optimization steps, resulting in overly long training times.
To solve this problem, the paper proposes a new method, namely **amortized inference for the entire family of kernel structures**, rather than being limited to a fixed kernel structure. Specific contributions include:
1. **Construct an amortized neural network**: This network is defined on the joint space of the data set and the kernel structure and explicitly combines the invariance and equivariance of the underlying space.
2. **Experimental proof of effectiveness**: Demonstrates the effectiveness of amortized inference on multiple simulated and real - world data sets and kernel structures.
3. **Rapid integration of different kernel structures**: Demonstrates the generality of the method by defining a fast integration method.
### Specific problem description
#### Limitations of traditional methods
- **Fixed kernel structure**: Existing methods such as Liu et al. [2020b] can only perform amortized inference on a fixed kernel structure. If different kernel structures are to be used, the network needs to be redesigned and retrained, which will lead to a large amount of design time and training time overhead.
- **Computational bottleneck**: Traditional kernel parameter learning methods (such as marginal likelihood maximization) require a large number of optimization steps, especially when dealing with medium - sized data sets, and the computational cost is very high.
#### Advantages of the new method
- **Joint - space amortized inference**: The new method avoids the trouble of redesigning and retraining the network every time the kernel structure is changed through amortized inference on the joint space of the data set and the kernel structure.
- **Fast inference**: It can quickly infer elements in each kernel structure family without retraining.
- **Generality**: It can quickly integrate different kernel structures, providing broader applicability.
### Mathematical formula representation
To understand the problem more clearly, some key mathematical formulas are listed here:
1. **Definition of Gaussian process**:
\[
f\sim\text{GP}(m,k)
\]
where \(m(x)\) is the mean function and \(k(x,x')\) is the kernel function.
2. **Marginal likelihood**:
\[
p(y|X,\theta,\sigma^{2})=\mathcal{N}(y;m(X),k_{\theta}(X,X)+\sigma^{2}I)
\]
where \(y\) is the observed value, \(X\) is the input data, \(\theta\) is the kernel parameter, and \(\sigma^{2}\) is the noise variance.
3. **Optimization problem**:
\[
(\theta^{*},\sigma^{2*})=\arg\max_{(\theta,\sigma^{2})\in\Phi}\log p(y|X,\theta,\sigma^{2})
\]
4. **Output of the amortized inference network**:
\[
(\hat{\theta}_{S},\hat{\sigma}^{2}) = g_{\psi}(D,S)
\]
where \(D\) is the data set, \(S\) is the kernel expression, and \(g_{\psi}\) is the amortized inference network.
In this way, the method proposed in the paper not only significantly reduces the inference time but also improves the flexibility and generality of the model, making the efficiency and performance of Gaussian processes in practical applications significantly improved.