Abstract:Training on mixtures of data distributions is now common in many modern machine learning pipelines, useful for performing well on several downstream tasks. Group distributionally robust optimization (group DRO) is one popular way to learn mixture weights for training a specific model class, but group DRO methods suffer for non-linear models due to non-convex loss functions and when the models are non-parametric. We address these challenges by proposing to solve a more general DRO problem, giving a method we call MixMax. MixMax selects mixture weights by maximizing a particular concave objective with entropic mirror ascent, and, crucially, we prove that optimally fitting this mixture distribution over the set of bounded predictors returns a group DRO optimal model. Experimentally, we tested MixMax on a sequence modeling task with transformers and on a variety of non-parametric learning problems. In all instances MixMax matched or outperformed the standard data mixing and group DRO baselines, and in particular, MixMax improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and CelebA annotations datasets.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the problem of how to choose the optimal data mixing weights when training machine learning models on mixed data distributions. Specifically, the authors focus on determining the weights of these data sources when training models on multiple data sources or different groups (e.g., data from different geographical locations or demographic characteristics) so that the model performs well even in the worst-case scenario. This optimization for worst-case performance is known as Distributionally Robust Optimization (DRO). Existing methods, such as group-based DRO (group DRO), have limitations when dealing with nonlinear and non-parametric models because the loss functions of these models are often non-convex. To this end, the authors propose a new method—MixMax, which selects data mixing weights by maximizing a specific concave objective function. The MixMax method is applicable not only to common parametric models but also to non-parametric models, such as Gradient Boosting Trees (XGBoost). ### Main Contributions 1. **Theoretical Contributions**: - Proposed a min-max theorem for the DRO problem in bounded function spaces. - Proved that for cross-entropy and squared loss, the optimal data mixing weights can be found by maximizing a concave objective function. 2. **Methodological Contributions**: - Proposed the MixMax method, which optimizes data mixing weights through the stochastic entropy mirror ascent algorithm. - The MixMax method can be applied to non-parametric models, which existing group DRO methods cannot handle. 3. **Experimental Validation**: - Conducted experiments using the Transformer model on sequence modeling tasks, showing that MixMax outperforms standard data mixing and group DRO baseline methods. - Conducted experiments using XGBoost on multiple tabular datasets, significantly improving test accuracy, especially in cases with large label shifts. ### Experimental Results - **Binary Classification Tasks**: MixMax chose weights closer to maximum entropy in binary classification tasks, thereby reducing test loss under the worst distribution. - **Regression Tasks**: MixMax balanced extreme Bayesian optimal functions, similarly reducing test loss under the worst distribution. - **Sequence Modeling Tasks**: On generated Markov chain data, MixMax found better mixing weights than other methods, and its ensemble model outperformed models trained with group DRO. - **XGBoost Experiments**: On the ACSIncome and CelebA datasets, MixMax significantly improved the accuracy of the worst group in cases with large label shifts. ### Conclusion By proposing the MixMax method, the paper addresses the problem of how to choose the optimal data mixing weights when training models on mixed data distributions, especially for non-parametric and nonlinear models. Experimental results show that MixMax outperforms existing baseline methods across various tasks and datasets.

Finding Optimally Robust Data Mixtures via Concave Maximization

Model-based clustering and classification using mixtures of multivariate skewed power exponential distributions

Decoupled Mixup for Data-efficient Learning

Harnessing Hard Mixed Samples with Decoupled Regularizer

Efficient Algorithms for Empirical Group Distributionally Robust Optimization and Beyond

Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning

RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Big Learning Expectation Maximization

Globalized distributionally robust optimization with multi core sets

Focus on the Common Good: Group Distributional Robustness Follows

Distributionally Robust Losses for Latent Covariate Mixtures

Modeling the Q-Diversity in a Min-max Play Game for Robust Optimization

Robust Mixture Learning when Outliers Overwhelm Small Groups

Towards Scalable and Fast Distributionally Robust Optimization for Data-Driven Deep Learning

Learning Distributionally Robust Models at Scale via Composite Optimization

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

DEM: Distribution Edited Model for Training with Mixed Data Distributions

Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis