Abstract:Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_\alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.

What problem does this paper attempt to address?

This paper focuses on the issue of bias in large-scale vision-language models (such as CLIP) when handling queries related to humans. Although these models perform well after being pre-trained on a large amount of image and text data, they also transfer harmful content and biases from the dataset to the learned model, resulting in biased retrieval results, misclassification, and other undesirable behaviors. The paper proposes a novel framework that creates a balanced and diverse dataset by synthesizing diverse counterfactual images, which is used for fine-tuning the CLIP model. This framework utilizes a text-to-image model to generate base images and then uses segmentation and repair models to place people with different appearances into different contexts, enabling the model to learn to distinguish people's appearances from contextual information such as image background, clothing, and objects held. The researchers demonstrate that the CLIP model fine-tuned using this dataset (referred to as CFα) improves key fairness metrics in image retrieval tasks by 40-66%, while maintaining similar performance in downstream tasks. In addition, they propose a simple plug-in method to control the trade-off between accuracy and fairness using weight interpolation techniques. The contributions of this paper include: 1. Providing a method to create a diverse and balanced image dataset starting from core visual concepts. 2. Fine-tuning CLIP by combining weight linear fusion and additional self-supervised loss terms. 3. Demonstrating that fine-tuning the model using the out-of-the-box dataset can mitigate bias in CLIP based on race and gender, and the model is compatible with the original pre-training model, allowing users to flexibly adjust the trade-off between accuracy and fairness. Experiments show that this approach significantly improves fairness in CLIP on real images, while controlling the trade-off between accuracy and fairness by integrating the weight space technique, without sacrificing performance in downstream tasks.

They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

VisionCLIP: An Med-AIGC based Ethical Language-Image Foundation Model for Generalizable Retina Image Analysis

Joint Vision-Language Social Bias Removal for CLIP

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

FairCLIP: Harnessing Fairness in Vision-Language Learning

FairCLIP: Social Bias Elimination based on Attribute Prototype Learning and Representation Neutralization

Finetuning CLIP to Reason about Pairwise Differences

FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs

Dataset Scale and Societal Consistency Mediate Facial Impression Bias in Vision-Language AI

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Uncovering Bias in Large Vision-Language Models with Counterfactuals

Evaluating the Fairness of Discriminative Foundation Models in Computer Vision

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias