Abstract:Face anti-spoofing (FAS) is essential for securing face recognition systems. Despite the decent performance, few existing works fully leverage temporal information. This would inevitably lead to inferior performance because real and fake faces tend to share highly similar spatial appearances, while important temporal features between consecutive frames are neglected. In this work, we propose a temporal transformer network (TTN) to learn multi-granularity temporal characteristics for FAS. It mainly consists of temporal difference attentions (TDA), a pyramid temporal aggregation (PTA), and a temporal depth difference loss (TDL). Firstly, the vision transformer (ViT) is used as the backbone where comprehensive local patches are utilized to provide subtle differences between live and spoof faces. Then, instead of learning temporal features on global faces which may miss some important local cues, the TDA is developed to extract motion-sensitive cues on each of the comprehensive local patches. Moreover, the TDA is inserted into different layers of the ViT, learning multi-scale motion-sensitive local cues to improve the FAS performance. Secondly, it is observed that different subjects may have different visual tempos in some actions, making it necessary to model different temporal speeds. Our PTA aggregates temporal features at various tempos, which could build short-range and long-range relations among multiple frames. Thirdly, depth maps for real parts may change continuously, while they remain zeros for spoof regions. In order to locate motion features on facial parts, the TDL is proposed to guide the network to locate spoof facial parts where motion patterns between neighboring frames are set as the ground truth. To the best of our knowledge, this work is the first attempt to learn temporal characteristics via transformers. Both qualitative and quantitative results on several challenging tasks demonstrate the usefulness and effectiveness of our proposed methods.

Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

Adaptive Transformers for Robust Few-shot Cross-domain Face Anti-spoofing

FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing

G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

Learning Multi-Granularity Temporal Characteristics for Face Anti-Spoofing

Self-Attention and MLP Auxiliary Convolution for Face Anti-Spoofing

Robust face anti-spoofing framework with Convolutional Vision Transformer

MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing

MF 2 ShrT: Multi-Modal Feature Fusion using Shared Layered Transformer for Face Anti-Spoofing

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Enhancing General Face Forgery Detection via Vision Transformer with Low-Rank Adaptation

Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-Spoofing

Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Vision Transformer for Action Units Detection

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention