ExpLLM: Towards Chain of Thought for Facial Expression Recognition

Xing Lan,Jian Xue,Ji Qi,Dongmei Jiang,Ke Lu,Tat-Seng Chua
2024-09-04
Abstract:Facial expression recognition (FER) is a critical task in multimedia with significant implications across various domains. However, analyzing the causes of facial expressions is essential for accurately recognizing them. Current approaches, such as those based on facial action units (AUs), typically provide AU names and intensities but lack insight into the interactions and relationships between AUs and the overall expression. In this paper, we propose a novel method called ExpLLM, which leverages large language models to generate an accurate chain of thought (CoT) for facial expression recognition. Specifically, we have designed the CoT mechanism from three key perspectives: key observations, overall emotional interpretation, and conclusion. The key observations describe the AU's name, intensity, and associated emotions. The overall emotional interpretation provides an analysis based on multiple AUs and their interactions, identifying the dominant emotions and their relationships. Finally, the conclusion presents the final expression label derived from the preceding analysis. Furthermore, we also introduce the Exp-CoT Engine, designed to construct this expression CoT and generate instruction-description data for training our ExpLLM. Extensive experiments on the RAF-DB and AffectNet datasets demonstrate that ExpLLM outperforms current state-of-the-art FER methods. ExpLLM also surpasses the latest GPT-4o in expression CoT generation, particularly in recognizing micro-expressions where GPT-4o frequently fails.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in Facial Expression Recognition (FER): 1. **Lack of in-depth analysis of expression causes**: Existing facial expression recognition methods typically only provide the names and intensities of Action Units (AUs) but lack a deep understanding of the interactions between these AUs and their relationship to the overall expression. 2. **Insufficient transparency and interpretability**: Current methods often lack an intuitive and transparent reasoning process when generating facial expression recognition results, making the results difficult to explain and verify. 3. **Challenges in micro-expression recognition**: Existing methods perform poorly in recognizing subtle expressions (such as micro-expressions), especially when dealing with complex emotional expressions. To address these issues, the paper proposes a new method—**ExpLLM** (Expression Large Language Model), which utilizes large language models to generate a detailed "Chain of Thought" (CoT) to provide step-by-step analysis of facial expressions. Specifically, ExpLLM constructs this chain of thought from three key perspectives: - **Key Observations**: Describes the name, intensity, and related emotions of each AU. - **Overall Emotional Interpretation**: Analyzes the dominant emotions and their relationships based on multiple AUs and their interactions. - **Conclusion**: Derives the final expression label based on the above analysis. Additionally, the paper introduces the **Exp-CoT Engine** to construct this expression chain of thought and generate instruction-description data pairs for training ExpLLM. Through this method, ExpLLM not only improves the accuracy of facial expression recognition but also generates detailed and logically coherent facial expression descriptions. ### Main Contributions 1. **Innovative Model**: Proposes ExpLLM, providing an intuitive and step-by-step facial expression analysis method, enhancing the transparency and interpretability of FER. 2. **Data Construction Engine**: Designs the Exp-CoT Engine to generate high-quality instruction-description data pairs, ensuring the model can accurately generate the chain of thought for facial expressions. 3. **Experimental Validation**: Conducts extensive experiments on multiple FER datasets, demonstrating the superiority of ExpLLM in accuracy and generating high-quality expression descriptions, particularly outperforming the latest GPT-4o model in recognizing micro-expressions.