Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-Spoofing

Arman Keresh,Pakizar Shamoi
2024-06-20
Abstract:Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against a traditional CNN model, EfficientNet b2, on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses the problem of liveness detection in computer vision, particularly in face recognition systems, known as face anti-spoofing. Although face recognition systems are widely used, they are vulnerable to fraud attacks such as impersonation using photos, videos, or masks. The researchers explored the Vision Transformer (ViT) architecture and combined it with the emerging property of DINO (Self-Supervised Vision Transformer) framework to improve the model's ability to learn distinguishing features from unlabeled data. The paper compared the performance of the ViT model fine-tuned with the DINO framework and the traditional Convolutional Neural Network (CNN) model EfficientNet b2 in face anti-spoofing tasks. The experimental results showed that the ViT model outperformed the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, the researchers collected a unique dataset from biometric applications to further validate these findings. The main contributions of the paper are: 1. Introducing the Vision Transformer architecture fine-tuned with the DINO framework for face anti-spoofing. 2. Comparative analysis of the performance of the traditional CNN model EfficientNet b2 and the fine-tuned ViT model in face anti-spoofing tasks. The paper reviewed existing face anti-spoofing methods, including traditional machine learning techniques and deep learning methods, particularly recent applications of the Transformer architecture in anti-spoofing. The research found that Transformer models, through self-attention mechanisms, can better capture global dependencies and effectively identify complex spoofing clues. In the experimental section, the researchers evaluated the model's performance using multiple benchmark datasets and proposed a training algorithm. The experimental results demonstrated that the ViT (DINO) model outperformed the EfficientNet b2 model in all evaluation metrics, proving the superiority of the Transformer architecture in face anti-spoofing tasks. Future work will focus on the model's generalization ability, computational complexity, the impact of environmental changes, and the integration of other data types and self-supervised learning techniques. Overall, this research emphasizes the importance of using advanced Transformer architectures and self-supervised learning to enhance the security of biometric recognition systems.