Robust face anti-spoofing framework with Convolutional Vision Transformer

Yunseung Lee,Youngjun Kwak,Jinho Shin
2023-07-24
Abstract:Owing to the advances in image processing technology and large-scale datasets, companies have implemented facial authentication processes, thereby stimulating increased focus on face anti-spoofing (FAS) against realistic presentation attacks. Recently, various attempts have been made to improve face recognition performance using both global and local learning on face images; however, to the best of our knowledge, this is the first study to investigate whether the robustness of FAS against domain shifts is improved by considering global information and local cues in face images captured using self-attention and convolutional layers. This study proposes a convolutional vision transformer-based framework that achieves robust performance for various unseen domain data. Our model resulted in 7.3%$p$ and 12.9%$p$ increases in FAS performance compared to models using only a convolutional neural network or vision transformer, respectively. It also shows the highest average rank in sub-protocols of cross-dataset setting over the other nine benchmark models for domain generalization.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the robustness issue against domain shifts in the Face Anti-Spoofing (FAS) task. Specifically, the paper proposes a framework based on the Convolutional Vision Transformer (ConViT) to extract local and global information from images and capture this information through self-attention mechanisms and convolutional layers. This approach aims to improve the model's generalization ability across different datasets, especially when dealing with unseen attack types. The main contributions of this study are: 1. **Proposing a new framework**: Combining the advantages of self-attention mechanisms and convolution operations, utilizing ConViT to extract image features, thereby achieving better generalization performance. 2. **Improving the label discretization method**: Transforming the binary classification problem into a regression problem, generating discretized pseudo-labels through the CutMix technique, addressing the issue of overfitting in traditional binary classification methods. 3. **Excellent experimental results**: On multiple benchmark datasets, the proposed ConViT framework significantly outperforms methods that use only convolutional neural networks or pure vision transformers, showing the best performance in domain generalization. In summary, this paper aims to develop an FAS model that can effectively handle domain shift issues, enhancing the model's robustness and generalization ability by integrating local and global information.