Abstract:As opposed to macro-expressions, micro-expressions are subtle and not easily detectable emotional expressions, often containing rich information about mental activities. The practical recognition of micro-expressions is essential in interrogation and healthcare. Neural networks are currently one of the most common approaches to micro-expression recognition. Still, neural networks often increase their complexity when improving accuracy, and overly large neural networks require extremely high hardware requirements for running equipment. In recent years, vision transformers based on self-attentive mechanisms have achieved accuracy in image recognition and classification that is no less than that of neural networks. Still, the drawback is that without the image-specific biases inherent to neural networks, the cost of improving accuracy is an exponential increase in the number of parameters. This approach describes training a facial expression feature extractor by transfer learning and then fine-tuning and optimizing the MobileViT model to perform the micro-expression recognition task. First, the CASME II, SAMM, and SMIC datasets are combined into a compound dataset, and macro-expression samples are extracted from the three macro-expression datasets. Each macro-expression sample and micro-expression sample are pre-processed identically to make them similar. Second, the macro-expression samples were used to train the MobileNetV2 block in MobileViT as a facial expression feature extractor and to save the weights when the accuracy was highest. Finally, some of the hyperparameters of the MobileViT model are determined by grid search and then fed into the micro-expression samples for training. The samples are classified using an SVM classifier. In the experiments, the proposed method obtained an accuracy of 84.27%, and the time to process individual samples was only 35.4 ms. Comparative experiments show that the proposed method is comparable to state-of-the-art methods in terms of accuracy while improving recognition efficiency.

Lightweight facial landmark detection network based on improved MobileViT

Real-Time Facial Landmark Detection by Attention-driven Lightweight Network

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

TransMarker: A Pure Vision Transformer for Facial Landmark Detection.

Lightweight Vision Transformer with Cross Feature Attention

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition

Cascaded Dual Vision Transformer for Accurate Facial Landmark Detection

A lightweight Transformer-based model for fish landmark detection

LW-ViT: The Lightweight Vision Transformer Model Applied in Offline Handwritten Chinese Character Recognition

Lightweight ViT Model for Micro-Expression Recognition Enhanced by Transfer Learning

RepViT: Revisiting Mobile CNN From ViT Perspective

Lantra: Taming Transformers for Robust Facial Landmark Detection

GhostFormer: Efficiently amalgamated CNN-transformer architecture for object detection

CloudViT: A Lightweight Vision Transformer Network for Remote Sensing Cloud Detection

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices.

3-D Facial Landmarks Detection for Intelligent Video Systems

FMViT: A multiple-frequency mixing Vision Transformer

DSCAFormer: Lightweight Vision Transformer With Dual-Branch Spatial Channel Aggregation