Abstract:Facial expression recognition (FER) is an important research topic in emotional artificial intelligence. In recent decades, researchers have made remarkable progress. However, current FER paradigms face challenges in generalization, lack semantic information aligned with natural language, and struggle to process both images and videos within a unified framework, making their application in multimodal emotion understanding and human-computer interaction difficult. Multimodal Large Language Models (MLLMs) have recently achieved success, offering advantages in addressing these issues and potentially overcoming the limitations of current FER paradigms. However, directly applying pre-trained MLLMs to FER still faces several challenges. Our zero-shot evaluations of existing open-source MLLMs on FER indicate a significant performance gap compared to GPT-4V and current supervised state-of-the-art (SOTA) methods. In this paper, we aim to enhance MLLMs' capabilities in understanding facial expressions. We first generate instruction data for five FER datasets with Gemini. We then propose a novel MLLM, named EMO-LLaMA, which incorporates facial priors from a pretrained facial analysis network to enhance human facial information. Specifically, we design a Face Info Mining module to extract both global and local facial information. Additionally, we utilize a handcrafted prompt to introduce age-gender-race attributes, considering the emotional differences across different human groups. Extensive experiments show that EMO-LLaMA achieves SOTA-comparable or competitive results across both static and dynamic FER datasets. The instruction dataset and code are available at <a class="link-external link-https" href="https://github.com/xxtars/EMO-LLaMA" rel="external noopener nofollow">this https URL</a>.

General Facial Representation Learning in a Visual-Linguistic Manner

Toward High Quality Facial Representation Learning

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer

Large Pose Face Recognition via Facial Representation Learning

Self-Supervised Facial Representation Learning with Facial Region Awareness

Efficient Facial Expression Recognition with Representation Reinforcement Network and Transfer Self-Training for Human–Machine Interaction

CRFAST: Clip-Based Reference-Guided Facial Image Semantic Transfer

Multimodal Image-Text Representation Learning for Sketch-Less Facial Image Retrieval

A Generative Framework for Self-Supervised Facial Representation Learning

Facial Expression Recognition Based on Zero-Addition Pretext Training and Feature Conjunction-Selection Network in Human–Robot Interaction

Generalizable Facial Expression Recognition

A Generalist FaceX via Learning Unified Facial Representation

Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

The Devil is in the Face: Exploiting Harmonious Representations for Facial Expression Recognition

Deep Margin-Sensitive Representation Learning for Cross-Domain Facial Expression Recognition

Unsupervised Facial Expression Representation Learning with Contrastive Local Warping

Facial expression recognition through multi-level features extraction and fusion

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition