Enhancing Diversity in Bayesian Deep Learning via Hyperspherical Energy Minimization of CKA

David Smerkous,Qinxun Bai,Fuxin Li
2024-11-01
Abstract:Particle-based Bayesian deep learning often requires a similarity metric to compare two networks. However, naive similarity metrics lack permutation invariance and are inappropriate for comparing networks. Centered Kernel Alignment (CKA) on feature kernels has been proposed to compare deep networks but has not been used as an optimization objective in Bayesian deep learning. In this paper, we explore the use of CKA in Bayesian deep learning to generate diverse ensembles and hypernetworks that output a network posterior. Noting that CKA projects kernels onto a unit hypersphere and that directly optimizing the CKA objective leads to diminishing gradients when two networks are very similar. We propose adopting the approach of hyperspherical energy (HE) on top of CKA kernels to address this drawback and improve training stability. Additionally, by leveraging CKA-based feature kernels, we derive feature repulsive terms applied to synthetically generated outlier examples. Experiments on both diverse ensembles and hypernetworks show that our approach significantly outperforms baselines in terms of uncertainty quantification in both synthetic and realistic outlier detection tasks.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **How to generate more diverse neural network ensembles in Bayesian deep learning to improve the performance of uncertainty estimation**. Specifically, the author focuses on enhancing network diversity by minimizing feature similarity, thereby better capturing different patterns of the model posterior distribution and improving the ability to detect out - of - distribution (OOD) samples. ### Problem Background In Bayesian deep learning, it is usually necessary to compare the similarity between two neural networks in order to generate diverse model ensembles or hypernetworks. Traditional similarity measurement methods (such as L1/L2 distance) perform poorly in high - dimensional spaces and lack permutation invariance, which means they are not suitable for comparing neural networks. To overcome these problems, Kornblith et al. proposed a method based on Centered Kernel Alignment (CKA), which can compare the functions of networks without relying on parameters or activation values. However, directly optimizing the CKA objective function will lead to the problem of vanishing gradients, especially when the two networks are very similar. Therefore, the author proposes a new method, combining CKA and Hyperspherical Energy (HE) minimization, to train diverse model ensembles more stably. ### Solution 1. **Introducing CKA as a similarity measure**: - CKA measures the similarity between two networks by comparing their Gram matrices on the same dataset. - The formula is: \[ \text{CKA}(K_1, K_2)=\frac{\text{HSIC}(K_1, K_2)}{\sqrt{\text{HSIC}(K_1, K_1)\text{HSIC}(K_2, K_2)}} \] where \(\text{HSIC}\) is the Hilbert - Schmidt Independence Criterion, which is used to measure the dependence between two random variables. 2. **Combining HE minimization**: - To overcome the problem of vanishing gradients in the CKA optimization process, the author introduces Hyperspherical Energy (HE) minimization. - The goal of HE is to make the models evenly distributed on the hypersphere, thereby maximizing the geodesic distance between them. - The formula is: \[ \text{HE - CKA}(K)=\frac{1}{LM(M - 1)}\sum_{l = 1}^{L}\sum_{m\neq m'}(\arccos(\bar{K}_m^T\bar{K}_{m'}))^{-s} \] where \(\bar{K}_m\) is the normalized Gram matrix, and \(s\) is the Riesz s - kernel function parameter. 3. **Applying to deep ensembles and hypernetworks**: - The author applies the above - mentioned method to deep ensembles and hypernetworks to generate diverse models. - For deep ensembles, the performance of uncertainty estimation is improved by minimizing the similarity between models. - For hypernetworks, by introducing the HE - CKA loss term, mode collapse is avoided, thereby generating more diverse weights. 4. **Synthesizing OOD feature diversity**: - To further improve the OOD detection performance, the author also proposes to use synthetic OOD samples to reduce feature similarity, thereby enhancing the model's sensitivity to abnormal samples. ### Experimental Results Experiments show that this method significantly outperforms the baseline methods on multiple benchmark datasets, especially performing well in uncertainty and OOD detection tasks. By introducing the HE - CKA loss term, the model not only maintains prediction accuracy but also achieves significant improvements in uncertainty and OOD detection. In conclusion, this paper, by introducing...