FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks

Peiran Wu,Che Liu,Canyu Chen,Jun Li,Cosmin I. Bercea,Rossella Arcucci
2024-10-02
Abstract:Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of insufficient fairness evaluation of multimodal large language models (MLLMs) in medical tasks. Specifically: 1. **Insufficient fairness evaluation**: Although the performance of MLLMs in medical tasks such as visual question answering (VQA) and report generation (RG) has been significantly improved, the fairness of these models among different demographic groups has not been fully explored. This problem is especially important in the medical field because unfair predictions may lead to harmful consequences. 2. **Lack of diverse datasets**: Existing medical multimodal datasets usually lack demographic diversity, which complicates the evaluation of model fairness. Moreover, many existing VQA datasets rely on closed - ended answers rather than open - ended free - form answers, limiting their applicability in actual clinical scenarios. 3. **Limitations of existing benchmarks**: Currently, there are no publicly available benchmarks that can comprehensively evaluate fairness in medical multimodal tasks, especially those involving multiple demographic attributes. To solve these problems, the authors propose FMBench, a benchmark specifically designed to evaluate the fairness of MLLMs in medical multimodal tasks. The main features of FMBench include: - **Covering four demographic attributes**: race, ethnicity, language, and gender. - **Including two tasks**: VQA and RG, both carried out in a zero - sample setting. - **Introducing a new evaluation metric**: Fairness - Aware Performance (FAP), which is used to evaluate the fairness performance of MLLMs among different demographic groups. Through FMBench, the authors hope to help the research community improve model evaluation methods and promote future progress in this area.