DICE: Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation

Alexandre Rame,Matthieu Cord
DOI: https://doi.org/10.48550/arXiv.2101.05544
2021-01-14
Abstract:Deep ensembles perform better than a single network thanks to the diversity among their members. Recent approaches regularize predictions to increase diversity; however, they also drastically decrease individual members' performances. In this paper, we argue that learning strategies for deep ensembles need to tackle the trade-off between ensemble diversity and individual accuracies. Motivated by arguments from information theory and leveraging recent advances in neural estimation of conditional mutual information, we introduce a novel training criterion called DICE: it increases diversity by reducing spurious correlations among features. The main idea is that features extracted from pairs of members should only share information useful for target class prediction without being conditionally redundant. Therefore, besides the classification loss with information bottleneck, we adversarially prevent features from being conditionally predictable from each other. We manage to reduce simultaneous errors while protecting class information. We obtain state-of-the-art accuracy results on CIFAR-10/100: for example, an ensemble of 5 networks trained with DICE matches an ensemble of 7 networks trained independently. We further analyze the consequences on calibration, uncertainty estimation, out-of-distribution detection and online co-distillation.
Machine Learning,Computer Vision and Pattern Recognition,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to increase the diversity between models while maintaining the performance of individual models in deep ensembles. Specifically, the author points out that although current methods increase diversity through regularized prediction, they often lead to a significant decline in the performance of individual members. Therefore, the goal of this paper is to find a training strategy to better balance the diversity of the ensemble and the individual accuracy. ### Problem Background Deep ensembles usually perform better than a single neural network because they can reduce prediction errors through the diversity between members. However, existing methods often sacrifice the performance of individual models when increasing diversity. For example, some methods increase diversity through regularized prediction, but this will lead to a decline in the performance of each model. In addition, traditional deep - ensemble training methods rely on randomness in the initialization and learning process, which cannot guarantee sufficient diversity. ### Core Problem of the Paper The author believes that the learning strategy needs to handle the trade - off between ensemble diversity and individual accuracy. To this end, they propose a new training criterion - DICE (Diversity in Deep Ensembles via Conditional Redundancy Adversarial Estimation), aiming to increase diversity by reducing conditional redundancy between features while protecting information related to the target class. ### Solution The main idea of DICE is that the features extracted from different members should only share information useful for predicting the target class and there should be no conditional redundancy. Specifically, DICE is achieved in the following ways: 1. **Information Bottleneck (IB)**: Use the information bottleneck principle to compress features and remove factors unrelated to the task. 2. **Conditional Redundancy Adversarial Estimation**: Through adversarial training, prevent features from being predictable under the condition of a given target class, thereby reducing redundancy. ### Formula Representation The objective function of DICE can be expressed as: \[ \text{DICE}_{\beta_{ceb}, \delta_{cr}}(Z_1, Z_2)=\frac{1}{\beta_{ceb}}[I(X; Z_1 | Y)+I(X; Z_2 | Y)] - [I(Y; Z_1)+I(Y; Z_2)]+\delta_{cr}I(Z_1; Z_2 | Y) \] where: - \(I(X; Z_i | Y)\) represents the conditional compression term, which reduces information unrelated to the task. - \(I(Y; Z_i)\) represents the correlation term, ensuring that the features contain important information about the task. - \(I(Z_1; Z_2 | Y)\) represents the conditional redundancy term, which reduces redundancy between features. ### Experimental Results The experimental results show that DICE significantly improves the classification accuracy on the CIFAR - 10 and CIFAR - 100 datasets, and in the case of the same ensemble size, the ensemble model trained by DICE performs better than the independently trained model. In addition, DICE also improves the uncertainty estimation and calibration performance. ### Summary This paper proposes a new training framework DICE, which increases the diversity of deep ensembles by reducing conditional redundancy between features while maintaining high accuracy of individual models. This method has achieved excellent performance on multiple benchmark datasets and provides a new direction for future research.