MF 2 ShrT: Multi-Modal Feature Fusion using Shared Layered Transformer for Face Anti-Spoofing

Aashania Antil,Chhavi Dhiman
DOI: https://doi.org/10.1145/3640817
2024-01-25
Abstract:In recent times, Face Anti-spoofing (FAS) has gained significant attention in both academic and industrial domains. Although various CNN-based solutions have emerged, multi-modal approaches incorporating RGB, depth, and IR have exhibited better performance than unimodal classifiers. The increasing veracity of modern presentation attack instruments results in a persistent need to enhance the performance of such models. Recently, self-attention-based vision transformers (ViT) have become a popular choice in this field. Their fundamental aspects for multimodal FAS have not been thoroughly explored yet. Therefore, we propose a novel framework for FAS called MF 2 ShrT, which is based on a pre-trained vision transformer. The proposed framework uses overlap patches and parameter sharing in the ViT network, allowing it to utilize multiple modalities in a computationally efficient manner. Furthermore, to effectively fuse intermediate features from different encoders of each ViT, we explore a T-encoder-based hybrid feature block enabling the system to identify correlations and dependencies across different modalities. MF 2 ShrT outperforms conventional vision transformers and achieves state-of-the-art performance on benchmarks CASIA-SURF and WMCA, demonstrating the efficiency of transformer-based models for PAD.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?