Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
Chenyu You,Yifei Min,Weicheng Dai,Jasjeet S. Sekhon,Lawrence Staib,James S. Duncan
2024-03-12
Abstract:Fine-tuning pre-trained vision-language models, like CLIP, has yielded
success on diverse downstream tasks. However, several pain points persist for
this paradigm: (i) directly tuning entire pre-trained models becomes both
time-intensive and computationally costly. Additionally, these tuned models
tend to become highly specialized, limiting their practicality for real-world
deployment; (ii) recent studies indicate that pre-trained vision-language
classifiers may overly depend on spurious features -- patterns that correlate
with the target in training data, but are not related to the true labeling
function; and (iii) existing studies on mitigating the reliance on spurious
features, largely based on the assumption that we can identify such features,
does not provide definitive assurance for real-world applications. As a
piloting study, this work focuses on exploring mitigating the reliance on
spurious features for CLIP without using any group annotation. To this end, we
systematically study the existence of spurious correlation on CLIP and
CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR),
verify that last-layer retraining can greatly improve group robustness on
pretrained CLIP. In view of them, we advocate a lightweight representation
calibration method for fine-tuning CLIP, by first generating a calibration set
using the pretrained CLIP, and then calibrating representations of samples
within this set through contrastive learning, all without the need for group
labels. Extensive experiments and in-depth visualizations on several benchmarks
validate the effectiveness of our proposals, largely reducing reliance and
significantly boosting the model generalization.
Computer Vision and Pattern Recognition,Machine Learning