EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Bohao Xing,Zitong Yu,Xin Liu,Kaishen Yuan,Qilang Ye,Weicheng Xie,Huanjing Yue,Jingyu Yang,Heikki Kälviäinen

2024-08-21

Abstract:Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at <a class="link-external link-https" href="https://github.com/xxtars/EMO-LLaMA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in the field of Facial Expression Recognition (FER): 1. **Insufficient Generalization Ability**: Existing FER methods have weak generalization capabilities across different datasets and modalities. 2. **Lack of a Unified Framework**: Current methods typically handle facial expression recognition tasks in static images and dynamic videos independently, lacking a unified framework to address both scenarios simultaneously. 3. **Missing Semantic Information**: Existing methods mainly focus on classification tasks and lack semantic information aligned with natural language, which limits their application in multimodal emotion understanding and human-computer interaction. 4. **Unconsidered Cross-Group Differences**: Some studies indicate that people from different racial and cultural backgrounds express emotions differently, but existing methods have not sufficiently considered these factors. To tackle these issues, the paper proposes EMO-LLaMA, a novel approach that enhances Multimodal Large Language Models (MLLM) using facial prior knowledge. Specifically, by generating instruction-tuned datasets and designing new model architectures, EMO-LLaMA can better understand and recognize facial expressions, achieving performance comparable to or better than existing state-of-the-art methods on multiple FER datasets.

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

Cgan Based Facial Expression Recognition for Human-Robot Interaction

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Combining 2D Gabor and Local Binary Pattern for Facial Expression Recognition Using Extreme Learning Machine

Facial Affective Behavior Analysis with Instruction Tuning

Efficient Facial Expression Recognition with Representation Reinforcement Network and Transfer Self-Training for Human–Machine Interaction

Understanding Naturalistic Facial Expressions with Deep Learning and Multimodal Large Language Models

EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis

ExpLLM: Towards Chain of Thought for Facial Expression Recognition

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Multi-Attention Module for Dynamic Facial Emotion Recognition

Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues

Semantic-Rich Facial Emotional Expression Recognition

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Exploring Large-scale Unlabeled Faces to Enhance Facial Expression Recognition

A Fine-Grained Facial Expression Database for End-to-End Multi-Pose Facial Expression Recognition

Exploring Facial Expression Recognition through Semi-Supervised Pretraining and Temporal Modeling

Adaptively Learning Facial Expression Representation via C-F Labels and Distillation

The Relationship Between the Three‐Dimensional (3D) Structures of BF Molecules and MHC‐Related Marek's Disease Resistance in Chickens