Abstract:Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to explore the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors (such as histograms of oriented gradients) benefits the ViT on IR modality but not RGB or Depth modalities. Second, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. Finally, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M A E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M A E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. One highlight is that the proposed method is robust under various missing-modality cases where previous multimodal FAS models suffer serious performance drops. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.

CA-MoEiT: Generalizable Face Anti-spoofing via Dual Cross-Attention and Semi-fixed Mixture-of-Expert

Selective Domain-Invariant Feature Alignment Network for Face Anti-Spoofing.

Adaptive Mixture of Experts Learning for Generalizable Face Anti-Spoofing

MA-ViT: Modality-Agnostic Vision Transformers for Face Anti-Spoofing

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Towards Data-Centric Face Anti-spoofing: Improving Cross-Domain Generalization via Physics-Based Data Synthesis

Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

MF 2 ShrT: Multi-Modal Feature Fusion using Shared Layered Transformer for Face Anti-Spoofing

Multi-modal Face Anti-spoofing Using Multi-fusion Network and Global Depth-wise Convolution

MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection

Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing

TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing.

FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing

Robust face anti-spoofing framework with Convolutional Vision Transformer

Dual-Cross Central Difference Network for Face Anti-Spoofing.

Self-Attention and MLP Auxiliary Convolution for Face Anti-Spoofing

Face anti-spoofing with cross-stage relation enhancement and spoof material perception

Face Anti-Spoofing with Human Material Perception

Multi-modal Face Anti-spoofing Using Channel Cross Fusion Network and Global Depth-Wise Convolution.

FAMIM: A Novel Frequency-Domain Augmentation Masked Image Model Framework for Domain Generalizable Face Anti-Spoofing.