Abstract:Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to explore the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors (such as histograms of oriented gradients) benefits the ViT on IR modality but not RGB or Depth modalities. Second, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. Finally, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M A E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M A E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. One highlight is that the proposed method is robust under various missing-modality cases where previous multimodal FAS models suffer serious performance drops. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.

Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals

Promoting cross-modal representations to improve multimodal foundation models for physiological signals

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

FreqMAE: Frequency-Aware Masked Autoencoder for Multi-Modal IoT Sensing

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on Facial Action Unit Detection

Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing

MFAE: Masked Frequency Autoencoders for Domain Generalization Face Anti-spoofing

Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face Anti-Spoofing

Neuro-BERT: Rethinking Masked Autoencoding for Self-supervised Neurological Pretraining

CiTrus: Squeezing Extra Performance out of Low-data Bio-signal Transfer Learning

Multimodal Masked Autoencoders Learn Transferable Representations

Neuro2vec: Masked Fourier Spectrum Prediction for Neurophysiological Representation Learning

MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning

Cyclic Autoencoder for Multimodal Data Alignment Using Custom Datasets

Multi-modal Facial Affective Analysis based on Masked Autoencoder

A Cross-Modal Adaptive Masked Autoencoder for Decoding Emotions with Multimodal Data

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

AbFTNet: An Efficient Transformer Network with Alignment before Fusion for Multimodal Automatic Modulation Recognition

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation