Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

Cui, Yawen,Liu, Xin
DOI: https://doi.org/10.1007/s11263-024-02055-1
IF: 13.369
2024-06-06
International Journal of Computer Vision
Abstract:Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to explore the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors (such as histograms of oriented gradients) benefits the ViT on IR modality but not RGB or Depth modalities. Second, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. Finally, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M A E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M A E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. One highlight is that the proposed method is robust under various missing-modality cases where previous multimodal FAS models suffer serious performance drops. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.
computer science, artificial intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize Vision Transformer (ViT) and Masked Autoencoder (MAE) in the multi - modal Face Anti - Spoofing (FAS) task. Specifically, the paper focuses on three key factors: 1. **Input**: Research the influence of local feature descriptors in different modalities (such as RGB, Infrared (IR), Depth) on the performance of ViT. The author finds that using local feature descriptors (such as Histogram of Oriented Gradients) can improve the performance of ViT in the IR modality, but has an insignificant effect on the RGB or Depth modality. 2. **Pre - training**: Explore the applicability of existing ImageNet pre - trained models in multi - modal FAS tasks. Due to the differences between tasks (FAS and general object classification) and modalities (multi - modal and single - modal), ImageNet pre - trained models may not be the best choice. For this reason, the author proposes the Modality - Asymmetric Masked Autoencoder (M\(^2\)A\(^2\)E) for unsupervised self - supervised pre - training to learn more intrinsic task - aware representations. 3. **Fine - tuning**: In view of the efficiency problem of directly fine - tuning the whole or part of ViT, an Adaptive Multimodal Adapter (AMA) is designed, which can efficiently aggregate local multi - modal features while freezing most of the ViT parameters. Through these studies, the paper aims to improve the robustness and generalization ability of ViT in multi - modal FAS tasks, especially in the case of missing modalities. The experimental results show that the proposed method has achieved state - of - the - art performance in various modal settings and performs excellently in cross - dataset tests.