Abstract:Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-\emph{VL} benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-\emph{VL} benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (\eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. \footnote{The dataset and code are available at the \href{<a class="link-external link-https" href="https://genderbiasvl.github.io/" rel="external noopener nofollow">this https URL</a>}{website}.}

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-Instructions

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

On Evaluating Adversarial Robustness of Large Vision-Language Models

Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

PIP: Detecting Adversarial Examples in Large Vision-Language Models Via Attention Patterns of Irrelevant Probe Questions

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

GenderBias-VL: Benchmarking Gender Bias in Vision Language Models Via Counterfactual Probing

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals

White-box Multimodal Jailbreaks Against Large Vision-Language Models

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

Uncovering Bias in Large Vision-Language Models with Counterfactuals

A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends

GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing