EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Bohao Xing,Zitong Yu,Xin Liu,Kaishen Yuan,Qilang Ye,Weicheng Xie,Huanjing Yue,Jingyu Yang,Heikki Kälviäinen
2024-08-21
Abstract:Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at <a class="link-external link-https" href="https://github.com/xxtars/EMO-LLaMA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in the field of Facial Expression Recognition (FER): 1. **Insufficient Generalization Ability**: Existing FER methods have weak generalization capabilities across different datasets and modalities. 2. **Lack of a Unified Framework**: Current methods typically handle facial expression recognition tasks in static images and dynamic videos independently, lacking a unified framework to address both scenarios simultaneously. 3. **Missing Semantic Information**: Existing methods mainly focus on classification tasks and lack semantic information aligned with natural language, which limits their application in multimodal emotion understanding and human-computer interaction. 4. **Unconsidered Cross-Group Differences**: Some studies indicate that people from different racial and cultural backgrounds express emotions differently, but existing methods have not sufficiently considered these factors. To tackle these issues, the paper proposes EMO-LLaMA, a novel approach that enhances Multimodal Large Language Models (MLLM) using facial prior knowledge. Specifically, by generating instruction-tuned datasets and designing new model architectures, EMO-LLaMA can better understand and recognize facial expressions, achieving performance comparable to or better than existing state-of-the-art methods on multiple FER datasets.