A New Family of Generalization Bounds Using Samplewise Evaluated CMI

Fredrik Hellström,Giuseppe Durisi
DOI: https://doi.org/10.48550/arXiv.2210.06422
2023-03-27
Abstract:We present a new family of information-theoretic generalization bounds, in which the training loss and the population loss are compared through a jointly convex function. This function is upper-bounded in terms of the disintegrated, samplewise, evaluated conditional mutual information (CMI), an information measure that depends on the losses incurred by the selected hypothesis, rather than on the hypothesis itself, as is common in probably approximately correct (PAC)-Bayesian results. We demonstrate the generality of this framework by recovering and extending previously known information-theoretic bounds. Furthermore, using the evaluated CMI, we derive a samplewise, average version of Seeger's PAC-Bayesian bound, where the convex function is the binary KL divergence. In some scenarios, this novel bound results in a tighter characterization of the population loss of deep neural networks than previous bounds. Finally, we derive high-probability versions of some of these average bounds. We demonstrate the unifying nature of the evaluated CMI bounds by using them to recover average and high-probability generalization bounds for multiclass classification with finite Natarajan dimension.
Machine Learning,Information Theory
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to introduce a new family of information - theoretic generalization bounds, which are evaluated by the discrete version of sample - conditional mutual information (CMI). Specifically, the author attempts to solve the following problems: 1. **Improve the tightness of existing generalization bounds**: The existing generalization bounds do not perform well on deep neural networks, especially becoming loose as the training time increases during the training process. The author hopes to provide tighter generalization bounds by introducing sample - conditional mutual information (e - CMI). 2. **Expand and unify existing theoretical results**: The author hopes to show that the proposed framework can re - derive and expand the previously known information - theoretic generalization bounds and can cover more types of loss metrics, such as binary KL - divergence. 3. **Handle multi - class classification problems**: The author attempts to use the e - CMI framework to provide average and high - probability generalization bounds for multi - class classification problems with a finite Natarajan dimension. 4. **Verify by numerical experiments**: Through experiments on the MNIST and CIFAR10 datasets, verify the effectiveness of the newly proposed generalization bounds in actual deep - learning scenarios, especially the superiority of the binary KL - bound compared to the existing square - root bound and linear bound. ### Main contributions - **A new family of generalization bounds**: Based on sample - conditional mutual information (e - CMI), several new generalization bounds are proposed, including the square - root bound, the linear bound, and the binary KL - bound. - **Application of sample - conditional mutual information**: It is shown how to use e - CMI to re - derive and expand the existing information - theoretic generalization bounds. - **High - probability bounds**: High - probability versions of generalization bounds are provided, which are suitable for multi - class classification problems. - **Numerical experiments**: Experiments prove that the newly proposed generalization bounds are tighter than the existing bounds in some cases. ### Mathematical formulas - **Conditional mutual information (CMI)**: \[ I(X; Y|Z) = D(P_{XY|Z}\|P_X|ZP_Y|Z) \] - **Sample - conditional mutual information (e - CMI)**: \[ I_z(X; Y) = D(P_{XY|Z = z}\|P_X|Z = zP_Y|Z = z) \] - **Square - root bound**: \[ |\mathbb{E}_{\tilde{Z}, S, R}[L_D(A, \tilde{Z}_S, R)]-\hat{L}|\leq\frac{1}{n}\sum_{i = 1}^n\sqrt{2I(\ell(A(\tilde{Z}_S, R), \tilde{Z}_i); S_i|\tilde{Z})} \] - **Binary KL - bound**: \[ d\left(\hat{L}\middle\|\frac{\hat{L}+L_D}{2}\right)\leq\frac{1}{n}\sum_{i = 1}^nI(\ell(A(\tilde{Z}_S, R), \tilde{Z}_i); S_i|\tilde{Z}) \] Through these formulas and methods, the author successfully provides a new tool to describe the performance of deep neural networks more accurately.