Abstract:Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.

What problem does this paper attempt to address?

The paper attempts to address the issue of insufficient perception and understanding of human faces in Multimodal Large Language Models (MLLMs). Although existing MLLMs have achieved significant results in a wide range of vision-language tasks, they perform poorly in handling fine-grained perception tasks related to human faces. The main reason is the lack of image-text datasets containing detailed face descriptions. To solve this problem, the authors first comprehensively evaluated the performance of existing MLLMs on facial perception tasks and found that these models struggle with such tasks. Subsequently, they designed a practical dataset construction pipeline by re-annotating the LAION-Face dataset and reformatting traditional facial perception datasets into a question-and-answer format suitable for MLLMs, creating a new multimodal large facial perception model—Face-MLLM. Additionally, they developed a novel three-stage training method aimed at improving the model's performance on both traditional and zero-shot facial perception tasks. Specifically, the main contributions of the paper include: 1. **Comprehensive Evaluation**: A thorough evaluation of the performance of existing MLLM models on facial perception tasks, revealing the limitations of current general models in this field. 2. **Dataset Construction**: Proposing a low-cost data construction pipeline to overcome the scarcity of suitable training data, including re-annotating the LAION-Face dataset and reformatting traditional facial datasets into MLLM-compatible formats. 3. **Three-Stage Training Method**: Based on these rich datasets, proposing a three-stage training method that effectively enhances the performance of Face-MLLM on both traditional and zero-shot facial perception tasks. 4. **New Benchmark**: Establishing a new benchmark for zero-shot facial attribute analysis, demonstrating the superior performance of Face-MLLM compared to existing state-of-the-art MLLMs. Through these methods, Face-MLLM not only excels in various facial perception tasks but also shows strong generalization capabilities in zero-shot facial attribute analysis tasks.

Face-MLLM: A Large Face Perception Model

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

InfMLLM: A Unified Framework for Visual-Language Tasks.

Assessment of Multimodal Large Language Models in Alignment with Human Values

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

A Survey of Multimodal Large Language Model from A Data-centric Perspective

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Do Multimodal Large Language Models See Like Humans?

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

A Survey on Evaluation of Multimodal Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks