Abstract:Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in a face-attribute face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of poor generalization performance in models due to spurious correlations. Specifically, the authors propose a counterfactual alignment method to detect and quantify spurious correlations in black-box classifiers. By generating counterfactual images and inputting them into other classifiers, the changes in these classifiers' outputs can be observed, quantified, and used to identify spurious correlations present in specific instances. ### Main Contributions 1. **Proposed Counterfactual Alignment Method**: This method quantifies feature relationships between classifiers using a relative change metric. It allows for both overall quantitative evaluation of the model and specific example queries to identify predictions made using spurious correlations. 2. **Validated Method Effectiveness**: Demonstrated the ability to detect spurious correlations in existing facial attribute and waterbird classifiers. The method's effectiveness was validated by observing intuitive trends in facial attribute classifiers and inducing spurious correlations to detect their presence. 3. **Evaluated Robust Optimization Methods**: Showed that the counterfactual alignment method can evaluate robust optimization methods (such as GroupDRO, JTT, and FLAC) by detecting the reduction of spurious correlations. Experimental results indicate that reducing spurious correlations improves the model's generalization performance. ### Experimental Setup - **Datasets**: CelebAHQ dataset (containing over 200k celebrity images, each with 40 facial attribute labels) and CelebA dataset (lower resolution, 178x178). - **Pre-trained Classifiers**: Obtained from the work of Vandenhende et al. (2020), these classifiers were trained on the CelebA dataset to predict 40 different facial attributes for 224x224 images. - **Autoencoder**: Used the VQ-GAN autoencoder from Esser et al. (2021), trained on the FacesHQ dataset, which combines CelebA HQ and Flickr-Faces-HQ (FFHQ) datasets, with a resolution of 256x256. ### Experimental Results - **Overall Statistical Analysis**: By generating counterfactual images and calculating relative changes, the relationships between different classifiers were demonstrated. Results showed that classifiers might not use correlated features present in the training data, and vice versa. - **Specific Example Analysis**: By selecting specific classifiers and images, the predictions of classifiers on specific images were explained, revealing potential spurious correlations. - **Comparison with Saliency Maps**: Showed that saliency maps might not provide meaningful feature localization in some cases, whereas the counterfactual alignment method offers a deeper understanding of the features used by the model. - **Validation of Induced Spurious Correlations**: By constructing a classifier with known spurious correlations, it was validated that the counterfactual alignment method could detect such biases. - **Evaluation of Robust Optimization Methods**: Evaluated methods like GroupDRO, JTT, and FLAC on the Waterbirds and CelebA datasets, showing that these methods effectively reduce spurious correlations and improve model classification performance. ### Limitations - **Challenges in Counterfactual Generation**: Generating high-quality counterfactual images is challenging, especially in terms of feature modulation. - **Model Dependency**: The method's effectiveness depends on high-quality autoencoders and classifiers, which need to be trained in the same training domain. Overall, the paper proposes a counterfactual alignment method to detect and quantify spurious correlations in models, providing a new tool to improve model generalization performance and fairness.

Identifying Spurious Correlations using Counterfactual Alignment

Counterfactual Adversarial Learning with Representation Interpolation

Robust and High-Order Correlation Alignment for Unsupervised Domain Adaptation

Debiasing Counterfactuals In the Presence of Spurious Correlations

Correct-N-Contrast: A Contrastive Approach for Improving Robustness to Spurious Correlations

Exploring Counterfactual Alignment Loss towards Human-centered AI

Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals

Counterexample Contrastive Learning for Spurious Correlation Elimination

Evaluating and Mitigating Bias in Image Classifiers: A Causal Perspective Using Counterfactuals

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals

A General Framework for 3D Model Co-Alignment.

Improving Visualization Interpretation Using Counterfactuals.

On Counterfactual Data Augmentation Under Confounding

Spurious Correlations and Where to Find Them

MetaCoCo: A New Few-Shot Classification Benchmark with Spurious Correlation

Measuring Spurious Correlation in Classification: 'Clever Hans' in Translationese

Counterfactually Fair Representation

Towards Generalizable Face Forgery Detection Via Mitigating Spurious Correlation

Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis

PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning