Abstract:Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Masked Face Transformer

Towards Mask-robust Face Recognition.

Enhancement of Human Face Mask Detection Performance by Using Ensemble Learning Models.

Motion Guided Token Compression for Efficient Masked Video Modeling

Region-Aware Face Swapping

SwinFace: A Multi-task Transformer for Face Recognition, Expression Recognition, Age Estimation and Attribute Estimation

Ensemble Learning using Transformers and Convolutional Networks for Masked Face Recognition

Adaptive Swin Transformers for Few-Shot Cross-Domain Silent Face Liveness Detection

MViT: Mask Vision Transformer for Facial Expression Recognition in the Wild

Face Transformer for Recognition

Joint Holistic and Masked Face Recognition

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Masked Face Recognition With Mask Transfer and Self-Attention Under the COVID-19 Pandemic

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Pyramid Fusion Transformer for Semantic Segmentation

Comparativa entre RESNET-50, VGG-16, Vision Transformer y Swin Transformer para el reconocimiento facial con oclusión de una mascarilla

Face Transformer: Towards High Fidelity and Accurate Face Swapping