Rashomon Capacity: A Metric for Predictive Multiplicity in Classification

Hsiang Hsu,Flavio du Pin Calmon
DOI: https://doi.org/10.48550/arXiv.2206.01295
2022-10-20
Abstract:Predictive multiplicity occurs when classification models with statistically indistinguishable performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new metric, called Rashomon Capacity, to measure predictive multiplicity in probabilistic classification. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure and report predictive multiplicity prior to model deployment.
Machine Learning,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in classification tasks, when multiple models have almost the same performance statistically, they may give conflicting predictions for individual samples. This phenomenon is called predictive multiplicity. When these models are used for decision - making (such as lending, education, criminal justice, etc.), if predictive multiplicity is not taken into account, it may lead to unreasonable and arbitrary decisions. Specifically, the author introduced a new metric - Rashomon Capacity, which is used to measure predictive multiplicity in probability classification. Different from previous multiplicity metrics that mainly focus on output - thresholded (i.e., 0 - 1) prediction categories, Rashomon Capacity can capture more subtle differences in individual sample score changes and is suitable for probability classifiers (such as neural networks with a softmax output layer). The following is a summary of the core content of this problem: ### 1. **Background and Motivation** - **Rashomon Effect**: Proposed by Breiman (2001), it describes the phenomenon that multiple different prediction models perform similarly in training or test losses. - **Predictive Multiplicity**: Occurs when competing models in the Rashomon set assign conflicting predictions to individual samples. This may lead to making unreasonable decisions in critical applications. ### 2. **Limitations of Existing Metrics** - Existing metrics (such as ambiguity and discrepancy) are mainly based on thresholded prediction categories and may mask the actual prediction diversity, especially in probability classifiers. ### 3. **Definition and Characteristics of Rashomon Capacity** - **Definition**: Rashomon Capacity quantifies the score changes of models in the Rashomon set for a given input sample through KL divergence. - **Formula Representation**: \[ m_C(x_i)=2^{C(M_\epsilon(x_i))}, \quad C(M_\epsilon(x_i)) = \sup_{P_M} \inf_{q \in \Delta^c} \mathbb{E}_{h \sim P_M} \left[ D_{\text{KL}}(h(x_i) \parallel q) \right] \] where \( M_\epsilon(x_i) \) is the set of output scores of all models in the Rashomon set for sample \( x_i \), \( P_M \) is the probability distribution of models in the Rashomon set, and \( D_{\text{KL}} \) is KL divergence. ### 4. **Computational Challenges and Solutions** - **Computational Challenges**: Exact calculation of the Rashomon set is usually infeasible, especially for complex hypothesis spaces (such as neural networks). - **Solutions**: Through the Model Weight Perturbation technique, find a Rashomon subset that can capture most of the score changes. Specific methods include: - Use the Adversarial Weight Perturbation (AWP) technique to explore the Rashomon set. - Use the Blahut - Arimoto algorithm to calculate Rashomon Capacity. ### 5. **Practical Applications and Significance** - Rashomon Capacity can help data scientists measure and report predictive multiplicity before model deployment, ensuring a more transparent and fair decision - making process. - By identifying and disclosing predictive multiplicity, unreasonable and arbitrary decisions can be reduced in critical areas (such as medicine, education, lending). In conclusion, this paper aims to solve the deficiencies of existing metrics in capturing the predictive multiplicity of probability classifiers by introducing a new metric - Rashomon Capacity, thereby improving the reliability and fairness of models in practical applications.