Abstract:Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in a face-attribute face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
The paper aims to address the issue of poor generalization performance in models due to spurious correlations. Specifically, the authors propose a counterfactual alignment method to detect and quantify spurious correlations in black-box classifiers. By generating counterfactual images and inputting them into other classifiers, the changes in these classifiers' outputs can be observed, quantified, and used to identify spurious correlations present in specific instances.
### Main Contributions
1. **Proposed Counterfactual Alignment Method**: This method quantifies feature relationships between classifiers using a relative change metric. It allows for both overall quantitative evaluation of the model and specific example queries to identify predictions made using spurious correlations.
2. **Validated Method Effectiveness**: Demonstrated the ability to detect spurious correlations in existing facial attribute and waterbird classifiers. The method's effectiveness was validated by observing intuitive trends in facial attribute classifiers and inducing spurious correlations to detect their presence.
3. **Evaluated Robust Optimization Methods**: Showed that the counterfactual alignment method can evaluate robust optimization methods (such as GroupDRO, JTT, and FLAC) by detecting the reduction of spurious correlations. Experimental results indicate that reducing spurious correlations improves the model's generalization performance.
### Experimental Setup
- **Datasets**: CelebAHQ dataset (containing over 200k celebrity images, each with 40 facial attribute labels) and CelebA dataset (lower resolution, 178x178).
- **Pre-trained Classifiers**: Obtained from the work of Vandenhende et al. (2020), these classifiers were trained on the CelebA dataset to predict 40 different facial attributes for 224x224 images.
- **Autoencoder**: Used the VQ-GAN autoencoder from Esser et al. (2021), trained on the FacesHQ dataset, which combines CelebA HQ and Flickr-Faces-HQ (FFHQ) datasets, with a resolution of 256x256.
### Experimental Results
- **Overall Statistical Analysis**: By generating counterfactual images and calculating relative changes, the relationships between different classifiers were demonstrated. Results showed that classifiers might not use correlated features present in the training data, and vice versa.
- **Specific Example Analysis**: By selecting specific classifiers and images, the predictions of classifiers on specific images were explained, revealing potential spurious correlations.
- **Comparison with Saliency Maps**: Showed that saliency maps might not provide meaningful feature localization in some cases, whereas the counterfactual alignment method offers a deeper understanding of the features used by the model.
- **Validation of Induced Spurious Correlations**: By constructing a classifier with known spurious correlations, it was validated that the counterfactual alignment method could detect such biases.
- **Evaluation of Robust Optimization Methods**: Evaluated methods like GroupDRO, JTT, and FLAC on the Waterbirds and CelebA datasets, showing that these methods effectively reduce spurious correlations and improve model classification performance.
### Limitations
- **Challenges in Counterfactual Generation**: Generating high-quality counterfactual images is challenging, especially in terms of feature modulation.
- **Model Dependency**: The method's effectiveness depends on high-quality autoencoders and classifiers, which need to be trained in the same training domain.
Overall, the paper proposes a counterfactual alignment method to detect and quantify spurious correlations in models, providing a new tool to improve model generalization performance and fairness.