Face-MLLM: A Large Face Perception Model

Haomiao Sun,Mingjie He,Tianheng Lian,Hu Han,Shiguang Shan
2024-10-28
Abstract:Although multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, their ability to perceive and understand human faces is rarely explored. In this work, we comprehensively evaluate existing MLLMs on face perception tasks. The quantitative results reveal that existing MLLMs struggle to handle these tasks. The primary reason is the lack of image-text datasets that contain fine-grained descriptions of human faces. To tackle this problem, we design a practical pipeline for constructing datasets, upon which we further build a novel multimodal large face perception model, namely Face-MLLM. Specifically, we re-annotate LAION-Face dataset with more detailed face captions and facial attribute labels. Besides, we re-formulate traditional face datasets using the question-answer style, which is fit for MLLMs. Together with these enriched datasets, we develop a novel three-stage MLLM training method. In the first two stages, our model learns visual-text alignment and basic visual question answering capability, respectively. In the third stage, our model learns to handle multiple specialized face perception tasks. Experimental results show that our model surpasses previous MLLMs on five famous face perception tasks. Besides, on our newly introduced zero-shot facial attribute analysis task, our Face-MLLM also presents superior performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient perception and understanding of human faces in Multimodal Large Language Models (MLLMs). Although existing MLLMs have achieved significant results in a wide range of vision-language tasks, they perform poorly in handling fine-grained perception tasks related to human faces. The main reason is the lack of image-text datasets containing detailed face descriptions. To solve this problem, the authors first comprehensively evaluated the performance of existing MLLMs on facial perception tasks and found that these models struggle with such tasks. Subsequently, they designed a practical dataset construction pipeline by re-annotating the LAION-Face dataset and reformatting traditional facial perception datasets into a question-and-answer format suitable for MLLMs, creating a new multimodal large facial perception model—Face-MLLM. Additionally, they developed a novel three-stage training method aimed at improving the model's performance on both traditional and zero-shot facial perception tasks. Specifically, the main contributions of the paper include: 1. **Comprehensive Evaluation**: A thorough evaluation of the performance of existing MLLM models on facial perception tasks, revealing the limitations of current general models in this field. 2. **Dataset Construction**: Proposing a low-cost data construction pipeline to overcome the scarcity of suitable training data, including re-annotating the LAION-Face dataset and reformatting traditional facial datasets into MLLM-compatible formats. 3. **Three-Stage Training Method**: Based on these rich datasets, proposing a three-stage training method that effectively enhances the performance of Face-MLLM on both traditional and zero-shot facial perception tasks. 4. **New Benchmark**: Establishing a new benchmark for zero-shot facial attribute analysis, demonstrating the superior performance of Face-MLLM compared to existing state-of-the-art MLLMs. Through these methods, Face-MLLM not only excels in various facial perception tasks but also shows strong generalization capabilities in zero-shot facial attribute analysis tasks.