Abstract:We investigate the in-distribution generalization of machine learning algorithms. We depart from traditional complexity-based approaches by analyzing information-theoretic bounds that quantify the dependence between a learning algorithm and the training data. We consider two categories of generalization guarantees: 1) Guarantees in expectation: These bounds measure performance in the average case. Here, the dependence between the algorithm and the data is often captured by information measures. While these measures offer an intuitive interpretation, they overlook the geometry of the algorithm's hypothesis class. Here, we introduce bounds using the Wasserstein distance to incorporate geometry, and a structured, systematic method to derive bounds capturing the dependence between the algorithm and an individual datum, and between the algorithm and subsets of the training data. 2) PAC-Bayesian guarantees: These bounds measure the performance level with high probability. Here, the dependence between the algorithm and the data is often measured by the relative entropy. We establish connections between the Seeger--Langford and Catoni's bounds, revealing that the former is optimized by the Gibbs posterior. We introduce novel, tighter bounds for various types of loss functions. To achieve this, we introduce a new technique to optimize parameters in probabilistic statements. To study the limitations of these approaches, we present a counter-example where most of the information-theoretic bounds fail while traditional approaches do not. Finally, we explore the relationship between privacy and generalization. We show that algorithms with a bounded maximal leakage generalize. For discrete data, we derive new bounds for differentially private algorithms that guarantee generalization even with a constant privacy parameter, which is in contrast to previous bounds in the literature.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the issue of in-distribution generalization of machine learning algorithms, with a focus on establishing rigorous upper bounds on generalization error. Specifically, the paper attempts to address the following key issues: 1. **Limitations of Traditional Complexity Methods**: - Traditional generalization theories are usually based on complexity methods such as Rademacher complexity and VC dimension. While these methods are effective in certain cases, they often overlook the dependency between the algorithm and the training data, especially the geometric structure dependency. 2. **Application of Information-Theoretic Methods**: - Introduce and analyze information-theoretic methods to quantify the dependency between learning algorithms and training data. These methods include Mutual Information, f-Divergence, etc., which can provide more intuitive explanations but may also overlook the geometric structure of the hypothesis class. 3. **Generalization Guarantees in Expectation**: - Study generalization guarantees in expectation, i.e., performance on average. By introducing geometric considerations such as the Wasserstein distance, the limitations of mutual information methods are improved, providing deeper insights and applying them to the derivation of generalization error bounds for the Stochastic Gradient Langevin Dynamics (SGLD) algorithm. 4. **PAC-Bayesian Generalization Guarantees**: - Study high-probability generalization guarantees, i.e., performance not lower than a certain threshold with high probability. By using measures such as Relative Entropy, the connection between Seeger–Langford and Catoni's bounds is revealed, showing that the former is optimized by the Gibbs posterior. Additionally, new, tighter bounds are proposed, applicable to different types of loss functions, such as bounded range, cumulant generating function, moments, or variance. 5. **Relationship Between Privacy and Generalization**: - Explore the relationship between privacy mechanisms and generalization performance. It is proven that algorithms with finite Maximal Leakage can generalize, and for Differential Privacy algorithms with discrete data, even if the privacy parameter remains unchanged, their generalization guarantees will diminish as the sample size increases. ### Summary By introducing information-theoretic methods, the paper aims to overcome the limitations of traditional complexity methods and provide more rigorous and comprehensive upper bounds on generalization error. Particularly in the areas of generalization guarantees in expectation and PAC-Bayesian generalization guarantees, new techniques and methods are proposed. Additionally, the paper explores the impact of privacy mechanisms on generalization performance, offering new perspectives for understanding and improving the generalization ability of machine learning algorithms.

An Information-Theoretic Approach to Generalization Theory

Generalization Bounds for Stochastic Gradient Langevin Dynamics: A Unified View Via Information Leakage Analysis

Generalization in Generative Adversarial Networks: A Novel Perspective from Privacy Protection.

A unified framework for information-theoretic generalization bounds

Information-theoretic generalization bounds for black-box learning algorithms

Towards Generalization Beyond Pointwise Learning: A Unified Information-theoretic Perspective

Rethinking Information-theoretic Generalization: Loss Entropy Induced PAC Bounds

An Optimal Transport View on Generalization.

Understanding the Generalization Ability of Deep Learning Algorithms: A Kernelized Rényi’s Entropy Perspective

Generalization Error Bounds for Noisy, Iterative Algorithms via Maximal Leakage

1 Generalization in Classical Statistical Learning Theory

Information-Theoretic Generalization Bounds for Transductive Learning and its Applications

On the Tightness of Information-Theoretic Bounds on Generalization Error of Learning Algorithms.

Class-wise Generalization Error: an Information-Theoretic Analysis

Estimating individual treatment effect: generalization bounds and algorithms

Information-Theoretic Generalization Bounds for Deep Neural Networks

Understanding the Generalization Ability of Deep Learning Algorithms: A Kernelized Renyi's Entropy Perspective

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization

On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning

Which Algorithms Have Tight Generalization Bounds?