G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

Jingyi Yang,Zitong Yu,Xiuming Ni,Jia He,Hui Li

2024-08-15

Abstract:In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to detect the differences between live faces and fake faces in videos, especially in terms of dynamic cues. Most of the existing Face Anti - Spoofing (FAS) methods mainly focus on single - frame scenarios and distinguish real faces from fake faces through photometric features. However, these purely photometric - driven methods often overlook dynamic cues that change over time, which may lead to wrong judgments by the FAS system in some cases, especially when the dynamic features are obvious and the photometric features are difficult to distinguish. To overcome this limitation, the paper proposes the Graph Guided Video Vision Transformer (G2V2former). This model combines facial images and facial key points to achieve the fusion of photometric features and dynamic features. Specifically, the paper designs a new temporal attention mechanism - Kronecker Temporal Attention (KTA). This attention mechanism has a wider receptive field and is helpful for capturing dynamic information. In addition, the paper also uses the low - semantic motion of facial key points to guide the high - semantic - change facial expressions, because the regions containing key points may reveal more dynamic cues. In conclusion, G2V2former aims to improve the performance of the face anti - spoofing system by combining information in the spatial and temporal dimensions, especially when dealing with dynamic cues. Verified by experiments on multiple benchmark datasets, this method shows superior performance in various scenarios.

G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

Unified Video and Image Representation for Boosted Video Face Forgery Detection

Learning Multi-Granularity Temporal Characteristics for Face Anti-Spoofing

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

GGViT:Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection

Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

Deep Spatial Gradient and Temporal Depth Learning for Face Anti-Spoofing

Face Anti-Spoofing by the Enhancement of Temporal Motion

Benchmarking Joint Face Spoofing and Forgery Detection with Visual and Physiological Cues

Dynamic Convolutional Network for Generalizable Face Anti-spoofing.

Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection

Constructing Spatio-Temporal Graphs for Face Forgery Detection

Enhance the Motion Cues for Face Anti-Spoofing using CNN-LSTM Architecture

Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer

Multi-Frames Temporal Abnormal Clues Learning Method for Face Anti-Spoofing

ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection

Static and Dynamic Fusion for Multi-modal Cross-ethnicity Face Anti-spoofing

FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing

Two-stream Convolutional Networks for Multi-frame Face Anti-spoofing