G$^2$V$^2$former: Graph Guided Video Vision Transformer for Face Anti-Spoofing

Jingyi Yang,Zitong Yu,Xiuming Ni,Jia He,Hui Li
2024-08-15
Abstract:In videos containing spoofed faces, we may uncover the spoofing evidence based on either photometric or dynamic abnormality, even a combination of both. Prevailing face anti-spoofing (FAS) approaches generally concentrate on the single-frame scenario, however, purely photometric-driven methods overlook the dynamic spoofing clues that may be exposed over time. This may lead FAS systems to conclude incorrect judgments, especially in cases where it is easily distinguishable in terms of dynamics but challenging to discern in terms of photometrics. To this end, we propose the Graph Guided Video Vision Transformer (G$^2$V$^2$former), which combines faces with facial landmarks for photometric and dynamic feature fusion. We factorize the attention into space and time, and fuse them via a spatiotemporal block. Specifically, we design a novel temporal attention called Kronecker temporal attention, which has a wider receptive field, and is beneficial for capturing dynamic information. Moreover, we leverage the low-semantic motion of facial landmarks to guide the high-semantic change of facial expressions based on the motivation that regions containing landmarks may reveal more dynamic clues. Extensive experiments on nine benchmark datasets demonstrate that our method achieves superior performance under various scenarios. The codes will be released soon.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to detect the differences between live faces and fake faces in videos, especially in terms of dynamic cues. Most of the existing Face Anti - Spoofing (FAS) methods mainly focus on single - frame scenarios and distinguish real faces from fake faces through photometric features. However, these purely photometric - driven methods often overlook dynamic cues that change over time, which may lead to wrong judgments by the FAS system in some cases, especially when the dynamic features are obvious and the photometric features are difficult to distinguish. To overcome this limitation, the paper proposes the Graph Guided Video Vision Transformer (G2V2former). This model combines facial images and facial key points to achieve the fusion of photometric features and dynamic features. Specifically, the paper designs a new temporal attention mechanism - Kronecker Temporal Attention (KTA). This attention mechanism has a wider receptive field and is helpful for capturing dynamic information. In addition, the paper also uses the low - semantic motion of facial key points to guide the high - semantic - change facial expressions, because the regions containing key points may reveal more dynamic cues. In conclusion, G2V2former aims to improve the performance of the face anti - spoofing system by combining information in the spatial and temporal dimensions, especially when dealing with dynamic cues. Verified by experiments on multiple benchmark datasets, this method shows superior performance in various scenarios.