Adaptive-avg-pooling based Attention Vision Transformer for Face Anti-spoofing

Jichen Yang,Fangfan Chen,Rohan Kumar Das,Zhengyu Zhu,Shunsi Zhang
2024-01-10
Abstract:Traditional vision transformer consists of two parts: transformer encoder and multi-layer perception (MLP). The former plays the role of feature learning to obtain better representation, while the latter plays the role of classification. Here, the MLP is constituted of two fully connected (FC) layers, average value computing, FC layer and softmax layer. However, due to the use of average value computing module, some useful information may get lost, which we plan to preserve by the use of alternative framework. In this work, we propose a novel vision transformer referred to as adaptive-avg-pooling based attention vision transformer (AAViT) that uses modules of adaptive average pooling and attention to replace the module of average value computing. We explore the proposed AAViT for the studies on face anti-spoofing using Replay-Attack database. The experiments show that the AAViT outperforms vision transformer in face anti-spoofing by producing a reduced equal error rate. In addition, we found that the proposed AAViT can perform much better than some commonly used neural networks such as ResNet and some other known systems on the Replay-Attack corpus.
Image and Video Processing,Signal Processing
What problem does this paper attempt to address?
This paper proposes a solution to the problem of information loss in traditional Vision Transformers (ViTs) for the task of Face Anti-Spoofing. The traditional ViT's Multi-Layer Perceptron (MLP) includes an average pooling module, which may result in the loss of useful information. To address this issue, the researchers propose a new framework called Adaptive-avg-pooling based Attention Vision Transformer (AA ViT). AA ViT replaces the average pooling module with adaptive average pooling and attention modules to preserve more information beneficial for classification. In the Face Anti-Spoofing task, particularly with experiments using the Replay-Attack database, AA ViT demonstrates superior performance compared to traditional Transformers and commonly used neural networks such as ResNet. It reduces the Equal Error Rate (EER). Through comparative experiments, the authors prove the importance of adaptive average pooling and attention mechanism in AA ViT, and AA ViT excels in countering face spoofing attacks, outperforming existing systems. Future work will explore different front-ends and extend AA ViT to other applications.