Abstract:Predictive multiplicity occurs when classification models with statistically indistinguishable performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new metric, called Rashomon Capacity, to measure predictive multiplicity in probabilistic classification. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure and report predictive multiplicity prior to model deployment.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in classification tasks, when multiple models have almost the same performance statistically, they may give conflicting predictions for individual samples. This phenomenon is called predictive multiplicity. When these models are used for decision - making (such as lending, education, criminal justice, etc.), if predictive multiplicity is not taken into account, it may lead to unreasonable and arbitrary decisions. Specifically, the author introduced a new metric - Rashomon Capacity, which is used to measure predictive multiplicity in probability classification. Different from previous multiplicity metrics that mainly focus on output - thresholded (i.e., 0 - 1) prediction categories, Rashomon Capacity can capture more subtle differences in individual sample score changes and is suitable for probability classifiers (such as neural networks with a softmax output layer). The following is a summary of the core content of this problem: ### 1. **Background and Motivation** - **Rashomon Effect**: Proposed by Breiman (2001), it describes the phenomenon that multiple different prediction models perform similarly in training or test losses. - **Predictive Multiplicity**: Occurs when competing models in the Rashomon set assign conflicting predictions to individual samples. This may lead to making unreasonable decisions in critical applications. ### 2. **Limitations of Existing Metrics** - Existing metrics (such as ambiguity and discrepancy) are mainly based on thresholded prediction categories and may mask the actual prediction diversity, especially in probability classifiers. ### 3. **Definition and Characteristics of Rashomon Capacity** - **Definition**: Rashomon Capacity quantifies the score changes of models in the Rashomon set for a given input sample through KL divergence. - **Formula Representation**: \[ m_C(x_i)=2^{C(M_\epsilon(x_i))}, \quad C(M_\epsilon(x_i)) = \sup_{P_M} \inf_{q \in \Delta^c} \mathbb{E}_{h \sim P_M} \left[ D_{\text{KL}}(h(x_i) \parallel q) \right] \] where \( M_\epsilon(x_i) \) is the set of output scores of all models in the Rashomon set for sample \( x_i \), \( P_M \) is the probability distribution of models in the Rashomon set, and \( D_{\text{KL}} \) is KL divergence. ### 4. **Computational Challenges and Solutions** - **Computational Challenges**: Exact calculation of the Rashomon set is usually infeasible, especially for complex hypothesis spaces (such as neural networks). - **Solutions**: Through the Model Weight Perturbation technique, find a Rashomon subset that can capture most of the score changes. Specific methods include: - Use the Adversarial Weight Perturbation (AWP) technique to explore the Rashomon set. - Use the Blahut - Arimoto algorithm to calculate Rashomon Capacity. ### 5. **Practical Applications and Significance** - Rashomon Capacity can help data scientists measure and report predictive multiplicity before model deployment, ensuring a more transparent and fair decision - making process. - By identifying and disclosing predictive multiplicity, unreasonable and arbitrary decisions can be reduced in critical areas (such as medicine, education, lending). In conclusion, this paper aims to solve the deficiencies of existing metrics in capturing the predictive multiplicity of probability classifiers by introducing a new metric - Rashomon Capacity, thereby improving the reliability and fairness of models in practical applications.

Rashomon Capacity: A Metric for Predictive Multiplicity in Classification

Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation

Predictive Multiplicity in Probabilistic Classification

An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification

Predictive Churn with the Set of Good Models

A Path to Simpler Models Starts With Noise

On the Rashomon ratio of infinite hypothesis sets

Amazing Things Come From Having Many Good Models

Multi-Target Multiplicity: Flexibility and Fairness in Target Specification under Resource Constraints

Exploration of the Rashomon Set Assists Trustworthy Explanations for Medical Data

Perceptions of the Fairness Impacts of Multiplicity in Machine Learning

Reconciling Model Multiplicity for Downstream Decision Making

Robustly estimating heterogeneity in factorial data using Rashomon Partitions

The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance

Accounting for multiplicity in machine learning benchmark performance

Partial Order in Chaos: Consensus on Feature Attributions in the Rashomon Set

Never mind the metrics -- what about the uncertainty? Visualising confusion matrix metric distributions

The Dataset Multiplicity Problem: How Unreliable Data Impacts Predictions

Efficient Exploration of the Rashomon Set of Rule Set Models

Practical Attribution Guidance for Rashomon Sets

Cross-model Fairness: Empirical Study of Fairness and Ethics Under Model Multiplicity