FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Peiran Wu,Che Liu,Canyu Chen,Jun Li,Cosmin I. Bercea,Rossella Arcucci

2024-10-02

Abstract:Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of insufficient fairness evaluation of multimodal large language models (MLLMs) in medical tasks. Specifically: 1. **Insufficient fairness evaluation**: Although the performance of MLLMs in medical tasks such as visual question answering (VQA) and report generation (RG) has been significantly improved, the fairness of these models among different demographic groups has not been fully explored. This problem is especially important in the medical field because unfair predictions may lead to harmful consequences. 2. **Lack of diverse datasets**: Existing medical multimodal datasets usually lack demographic diversity, which complicates the evaluation of model fairness. Moreover, many existing VQA datasets rely on closed - ended answers rather than open - ended free - form answers, limiting their applicability in actual clinical scenarios. 3. **Limitations of existing benchmarks**: Currently, there are no publicly available benchmarks that can comprehensively evaluate fairness in medical multimodal tasks, especially those involving multiple demographic attributes. To solve these problems, the authors propose FMBench, a benchmark specifically designed to evaluate the fairness of MLLMs in medical multimodal tasks. The main features of FMBench include: - **Covering four demographic attributes**: race, ethnicity, language, and gender. - **Including two tasks**: VQA and RG, both carried out in a zero - sample setting. - **Introducing a new evaluation metric**: Fairness - Aware Performance (FAP), which is used to evaluate the fairness performance of MLLMs among different demographic groups. Through FMBench, the authors hope to help the research community improve model evaluation methods and promote future progress in this area.

FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

The Guideline for Building Fair Multimodal Medical AI with Large Vision-Language Model

Large Language Model Benchmarks in Medical Tasks

Unveiling Performance Challenges of Large Language Models in Low-Resource Healthcare: A Demographic Fairness Perspective

DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

Towards Evaluating and Building Versatile Large Language Models for Medicine

Large Language Models in Healthcare: A Comprehensive Benchmark

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models

CLIMB: A Benchmark of Clinical Bias in Large Language Models

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

FairCLIP: Harnessing Fairness in Vision-Language Learning

Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts

FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

MMBench: Is Your Multi-modal Model an All-around Player?

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark