Abstract:Face anti-spoofing (FAS) is essential for securing face recognition systems. Despite the decent performance, few existing works fully leverage temporal information. This would inevitably lead to inferior performance because real and fake faces tend to share highly similar spatial appearances, while important temporal features between consecutive frames are neglected. In this work, we propose a temporal transformer network (TTN) to learn multi-granularity temporal characteristics for FAS. It mainly consists of temporal difference attentions (TDA), a pyramid temporal aggregation (PTA), and a temporal depth difference loss (TDL). Firstly, the vision transformer (ViT) is used as the backbone where comprehensive local patches are utilized to provide subtle differences between live and spoof faces. Then, instead of learning temporal features on global faces which may miss some important local cues, the TDA is developed to extract motion-sensitive cues on each of the comprehensive local patches. Moreover, the TDA is inserted into different layers of the ViT, learning multi-scale motion-sensitive local cues to improve the FAS performance. Secondly, it is observed that different subjects may have different visual tempos in some actions, making it necessary to model different temporal speeds. Our PTA aggregates temporal features at various tempos, which could build short-range and long-range relations among multiple frames. Thirdly, depth maps for real parts may change continuously, while they remain zeros for spoof regions. In order to locate motion features on facial parts, the TDL is proposed to guide the network to locate spoof facial parts where motion patterns between neighboring frames are set as the ground truth. To the best of our knowledge, this work is the first attempt to learn temporal characteristics via transformers. Both qualitative and quantitative results on several challenging tasks demonstrate the usefulness and effectiveness of our proposed methods.

SA$$^3$$WT: Adaptive Wavelet-Based Transformer with Self-Paced Auto Augmentation for Face Forgery Detection

AGIL-SwinT: Attention-guided Inconsistency Learning for Face Forgery Detection

Unified Video and Image Representation for Boosted Video Face Forgery Detection

F2Trans: High-Frequency Fine-Grained Transformer for Face Forgery Detection

Face Forgery Detection with Long-Range Noise Features and Multilevel Frequency-Aware Clues

Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

Enhancing General Face Forgery Detection via Vision Transformer with Low-Rank Adaptation

Learning Multi-Granularity Temporal Characteristics for Face Anti-Spoofing

WATCHER: Wavelet-Guided Texture-Content Hierarchical Relation Learning for Deepfake Detection

MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection

UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery Localization

Face forgery detection by progressively enhancing spatial and frequency-aware features

Spatial-temporal Transformer Network for Protecting Person-of-interest from Deepfaking

Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection

Adt: anti-deepfake transformer

Adaptive Texture and Spectrum Clue Mining for Generalizable Face Forgery Detection

Wavelet-enhanced Weakly Supervised Local Feature Learning for Face Forgery Detection

Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture

UIA-ViT: Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection.