Abstract:Particle-based Bayesian deep learning often requires a similarity metric to compare two networks. However, naive similarity metrics lack permutation invariance and are inappropriate for comparing networks. Centered Kernel Alignment (CKA) on feature kernels has been proposed to compare deep networks but has not been used as an optimization objective in Bayesian deep learning. In this paper, we explore the use of CKA in Bayesian deep learning to generate diverse ensembles and hypernetworks that output a network posterior. Noting that CKA projects kernels onto a unit hypersphere and that directly optimizing the CKA objective leads to diminishing gradients when two networks are very similar. We propose adopting the approach of hyperspherical energy (HE) on top of CKA kernels to address this drawback and improve training stability. Additionally, by leveraging CKA-based feature kernels, we derive feature repulsive terms applied to synthetically generated outlier examples. Experiments on both diverse ensembles and hypernetworks show that our approach significantly outperforms baselines in terms of uncertainty quantification in both synthetic and realistic outlier detection tasks.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to generate more diverse neural network ensembles in Bayesian deep learning to improve the performance of uncertainty estimation**. Specifically, the author focuses on enhancing network diversity by minimizing feature similarity, thereby better capturing different patterns of the model posterior distribution and improving the ability to detect out - of - distribution (OOD) samples. ### Problem Background In Bayesian deep learning, it is usually necessary to compare the similarity between two neural networks in order to generate diverse model ensembles or hypernetworks. Traditional similarity measurement methods (such as L1/L2 distance) perform poorly in high - dimensional spaces and lack permutation invariance, which means they are not suitable for comparing neural networks. To overcome these problems, Kornblith et al. proposed a method based on Centered Kernel Alignment (CKA), which can compare the functions of networks without relying on parameters or activation values. However, directly optimizing the CKA objective function will lead to the problem of vanishing gradients, especially when the two networks are very similar. Therefore, the author proposes a new method, combining CKA and Hyperspherical Energy (HE) minimization, to train diverse model ensembles more stably. ### Solution 1. **Introducing CKA as a similarity measure**: - CKA measures the similarity between two networks by comparing their Gram matrices on the same dataset. - The formula is: \[ \text{CKA}(K_1, K_2)=\frac{\text{HSIC}(K_1, K_2)}{\sqrt{\text{HSIC}(K_1, K_1)\text{HSIC}(K_2, K_2)}} \] where \(\text{HSIC}\) is the Hilbert - Schmidt Independence Criterion, which is used to measure the dependence between two random variables. 2. **Combining HE minimization**: - To overcome the problem of vanishing gradients in the CKA optimization process, the author introduces Hyperspherical Energy (HE) minimization. - The goal of HE is to make the models evenly distributed on the hypersphere, thereby maximizing the geodesic distance between them. - The formula is: \[ \text{HE - CKA}(K)=\frac{1}{LM(M - 1)}\sum_{l = 1}^{L}\sum_{m\neq m'}(\arccos(\bar{K}_m^T\bar{K}_{m'}))^{-s} \] where \(\bar{K}_m\) is the normalized Gram matrix, and \(s\) is the Riesz s - kernel function parameter. 3. **Applying to deep ensembles and hypernetworks**: - The author applies the above - mentioned method to deep ensembles and hypernetworks to generate diverse models. - For deep ensembles, the performance of uncertainty estimation is improved by minimizing the similarity between models. - For hypernetworks, by introducing the HE - CKA loss term, mode collapse is avoided, thereby generating more diverse weights. 4. **Synthesizing OOD feature diversity**: - To further improve the OOD detection performance, the author also proposes to use synthetic OOD samples to reduce feature similarity, thereby enhancing the model's sensitivity to abnormal samples. ### Experimental Results Experiments show that this method significantly outperforms the baseline methods on multiple benchmark datasets, especially performing well in uncertainty and OOD detection tasks. By introducing the HE - CKA loss term, the model not only maintains prediction accuracy but also achieves significant improvements in uncertainty and OOD detection. In conclusion, this paper, by introducing...

Enhancing Diversity in Bayesian Deep Learning via Hyperspherical Energy Minimization of CKA

Hierarchical Knowledge Amalgamation with Dual Discriminative Feature Alignment

Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks

Learning to Diversify via Weighted Kernels for Classifier Ensemble

Diversified deep hierarchical kernel ensemble regression

NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel

Rethinking Centered Kernel Alignment in Knowledge Distillation

Scalable Bayesian Deep Learning with Kernel Seed Networks

Sparse Kernel Entropy Component Analysis for Dimensionality Reduction of Biomedical Data

Optimizing Kernel Machines using Deep Learning

Deep Ensembles: A Loss Landscape Perspective

Guided Deep Kernel Learning

Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models

Robust kernel ensemble regression in diversified kernel space with shared parameters

An Approach Towards Learning K-means-friendly Deep Latent Representation

Towards out-of-distribution generalization in large-scale astronomical surveys: robust networks learn similar representations

Learning to Warm-Start Bayesian Hyperparameter Optimization

Efficient Hyperparameter Optimization for Deep Learning Algorithms Using Deterministic RBF Surrogates

Deep Kernel Learning-Based Bayesian Optimization with Adaptive Kernel Functions

Dynamic random mutation hybrid Harris hawk optimization and its application to training kernel extreme learning machine

Leveraging the Bhattacharyya coefficient for uncertainty quantification in deep neural networks