Abstract:Objective Micro-expressions are brief,subtle facial muscle movements that accidentally signal emotions when the person tries to hide their true inner feelings.Micro-expressions are more responsive to a person's true feelings and moti-vations than macro-expressions.Micro-expression recognition aims to analyze and identify automatically the emotional cat-egory of the research object from the stressful movement of the facial muscles,which has an important application value in lie detection,psychological diagnosis,and other aspects.In the early development of micro-expression recognition,local binary patterns and optical flow were widely used as features for training traditional machine learning models.However,the traditional manual feature approach relies on manually designing rules,making it difficult to adapt to the differences in micro-expression data across different individuals and scenarios.Given that deep learning can automatically learn the opti-mal feature representation of an image,the recognition performance of micro-expression recognition studies based on deep learning far exceeds that of traditional methods.However,micro-expressions occur as subtle facial changes,which causes the micro-expression recognition task to remain challenging.By analyzing the pixel movement between consecutive frames,the optical flow can represent the dynamic information of micro-expressions.Deep learning-based micro-expression recogni-tion methods perform facial muscle motion descriptions with optical flow information to improve micro-expression recogni-tion performance.However,existing micro-expression recognition methods usually extract the optical flow information offline,which relies on existing optical flow estimation techniques and suffers from the insufficient description of subtle expressions and neglect of static facial expression information,which restricts the recognition effect of the model.There-fore,this study proposes a micro-expression recognition network based on adaptive optical flow estimation,which realizes optical flow estimation and micro-expression classification to learn micro-expression-related motion features through parallel association adaptively.Method The training samples of micro-expressions are limited,which makes it difficult to train com-plex network models.Therefore,this study selects the apex and their neighboring frames in the micro-expression video sequence as training data in the preprocessing stage.In addition,when loading the data,the original training data are replaced with image pairs containing motion information in the video sequence with a certain probability.Second,the deep learning network with a dense differential encoder-decoder implements the facial muscle motion adaptive optical flow esti-mation task to improve the characterization of subtle expressions.ResNet18 extracts features from the two-frame image and the difference map in a dense differential encoder.The branch processing the two frames shares the parameters.A motion enhancement module is added to the feature extraction branch of the differential image to accomplish the interlayer informa-tion interaction.In the motion enhancement module,the difference map features computed from the two frames need the spatial attention mechanism to focus on the micro-expression-related motion;the two frames are subtracted from each other to preserve and amplify the difference between the two frames,and using the two features provides valid information for sub-sequent networks.The decoder in this study maps the multilevel facial displacement information extracted by the dense dif-ferential encoder and the last layer of the two-frame image output features to reconstruct the optical flow features.Vision Transformer is a deep learning model based on the self-attention mechanism,which has global perception capability in com-parison with the traditional convolutional neural network.Then,with the feature extraction capability of vision Trans-former,the micro-expression discriminative information embedded in the reconstructed optical flow is mined.Finally,the semantic information of micro-expressions extracted from facial displacement information and the discriminative information of micro-expressions extracted from the vision Transformer model are fused to provide rich information for micro-expression classification.This study uses the Endpoint error loss constraint for the optical flow estimation task to achieve the learning purpose,which continuously reduces the Euclidean distance between the predicted and real optical flow.Cross entropy loss function constraints are used for the features extracted by vision Transformer and the fused features,which make the network learn micro-expression related information.At the same time,the image with low motion intensity in the two frames is equivalent to the neutral expression(without motion information),and the KL-divergence loss is applied to the output of the feature by the encoder to suppress irrelevant information.The loss functions interact to complete the network optimization.Result This study evaluates the model performance on a public dataset using the leave-one-subject-out cross-validation evaluation strategy.Face alignment and cropping are performed on the public dataset samples to unify the data-set.To demonstrate the state-of-the-art of the proposed method,we compare it with existing mainstream methods on com-posite datasets constructed by SMIC,SAMM,and CASME Ⅱ.Our method achieves 82.89％and 85.59％UF1 and UAR on the whole dataset,78.16％and 80.89％UF1 and UAR on the SMIC part,94.52％and 96.02％UF1 and UAR on the CASME Ⅱ part,and 73.24％and 75.83％.Our method achieves optimal results in the whole dataset,the SMIC part,and the CASME Ⅱ part,and suboptimal results in the SAMM part.Compared to the latest proposed micro-expression method based on feature representation learning with adaptive displacement generation and Transformer fusion(FRL-DGT),our method demonstrates an improvement of 1.77％and 4.85％.Conclusion The micro-expression recognition model based on adaptive optical flow estimation proposed in this study fuses the proposed two tasks of adaptive optical flow estimation and micro-expression categorization,which,on the one hand,senses the subtle facial movements in an end-to-end manner and improves the ability of subtle expression description,and on the other hand,fully exploits the micro-expression discrimina-tive information and enhances the micro-expression performance.

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Micro-expression Spotting with Multi-scale Local Transformer in Long Videos

Multi-Scale Spatio-Temporal Graph Convolutional Network for Facial Expression Spotting

Transfer Spatio-Temporal Knowledge from Emotion-Related Tasks for Facial Expression Spotting.

3D-CNN for Facial Micro- and Macro-expression Spotting on Long Video Sequences using Temporal Oriented Reference Frame

A dual-branch network based on optical flow learning and semantic consistency for macro-expression spotting

Spotting Micro-Expressions on Long Videos Sequences

A Magnitude and Angle Combined Optical Flow Feature for Microexpression Spotting

Synergistic Spotting and Recognition of Micro-Expression via Temporal State Transition

LGSNet: A Two-Stream Network for Micro- and Macro-Expression Spotting With Background Modeling

Two-Level Spatio-Temporal Feature Fused Two-Stream Network for Micro-Expression Recognition

Integrating VideoMAE based model and Optical Flow for Micro- and Macro-expression Spotting

Recognising Spontaneous Facial Micro-Expressions

PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding

MESNet: A Convolutional Neural Network for Spotting Multi-Scale Micro-Expression Intervals in Long Videos

Facial Micro-Expression Recognition Based on Multi-Scale Temporal and Spatial Features

Micro-expression Recognition Using Dynamic Textures on Tensor Independent Color Space

Adaptive Optical Flow Estimation-Driven Micro-Expression Recognition

A Main Directional Mean Optical Flow Feature for Spontaneous Micro-Expression Recognition.

Needle in a Haystack: Spotting and recognising micro-expressions “in the wild”

MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition