Abstract:In the billions of faces that are shaped by thousands of different cultures and ethnicities, one thing remains universal: the way emotions are expressed. To take the next step in human-machine interactions, a machine (e.g., a humanoid robot) must be able to clarify facial emotions. Allowing systems to recognize micro-expressions affords the machine a deeper dive into a person's true feelings, which will take human emotion into account while making optimal decisions. For instance, these machines will be able to detect dangerous situations, alert caregivers to challenges, and provide appropriate responses. Micro-expressions are involuntary and transient facial expressions capable of revealing genuine emotions. We propose a new hybrid neural network (NN) model capable of micro-expression recognition in real-time applications. Several NN models are first compared in this study. Then, a hybrid NN model is created by combining a convolutional neural network (CNN), a recurrent neural network (RNN, e.g., long short-term memory (LSTM)), and a vision transformer. The CNN can extract spatial features (within a neighborhood of an image), whereas the LSTM can summarize temporal features. In addition, a transformer with an attention mechanism can capture sparse spatial relations residing in an image or between frames in a video clip. The inputs of the model are short facial videos, while the outputs are the micro-expressions recognized from the videos. The NN models are trained and tested with publicly available facial micro-expression datasets to recognize different micro-expressions (e.g., happiness, fear, anger, surprise, disgust, sadness). Score fusion and improvement metrics are also presented in our experiments. The results of our proposed models are compared with that of literature-reported methods tested on the same datasets. The proposed hybrid model performs the best, where score fusion can dramatically increase recognition performance.

CNN-Transformer Architecture Solution for Compound Facial Expression Recognition

Attention on Emotions: A Vision Transformer Approach to Advancing Facial Expression Recognition

Compound facial expressions recognition approach using DCGAN and CNN

Facial Expression Recognition Based on Multi-Scale Convolutional Vision Transformer

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

Facial Expression Recognition Based on Fine-Tuned Channel–Spatial Attention Transformer

Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion

ViTCN: Hybrid Vision Transformer with Temporal Convolution for Multi-Emotion Recognition

A Robust Lightweight Compound Emotion Recognition Approach Using Depthwise Separable CNN

TriCAFFNet: A Tri-Cross-Attention Transformer with a Multi-Feature Fusion Network for Facial Expression Recognition

Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition

Robust facial expression recognition with Transformer Block Enhancement Module

Facial Expression Recognition Model Based on CNN and Data Augmentation Method

Facial Micro-Expression Recognition Enhanced by Score Fusion and a Hybrid Model from Convolutional LSTM and Vision Transformer

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

CIT-EmotionNet: CNN Interactive Transformer Network for EEG Emotion Recognition

A Unified Transformer-based Network for multimodal Emotion Recognition

A Joint Local Spatial and Global Temporal CNN-Transformer for Dynamic Facial Expression Recognition

Emotion Recognition Using Transformers with Masked Learning

A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition