Abstract:Visual and audio signals are concurrent and complementary types of modality in some video actions. A single visual modality limits the performance of video action recognition due to the similar appearance with subtle movements, such as tapping guitar and playing guitar. However, it is challenging to fuse audio-visual modalities since the heterogeneity gap caused by inconsistent distribution and representation of multi-modal data. In this paper, we propose a local-to-global multi-modal interaction network (LGMI-Net) that integrates RGB, optical flow with sound information. First, for the local multi-modal interaction, we propose a novel inter-modal channel recalibration (IMCR) block to learn a joint representation from different input modalities by recalibrating the channel information distribution of one modality according to another modality. Besides, we also propose a novel RGB modality aggregation (RMA) block to obtain the more robust appearance features by mixing optical flow and sound information. Second, for the global multi-modal interaction, we propose three distinct encoder modes: unitary, parallel and triplet encoders, to capture the global multi-modal representation. The unitary encoder has the lowest computational complexity. The parallel encoder utilizes RGB-guided attention to improve accuracy while maintaining the lightweight. The triplet encoder aggregates the self-attentions of different modalities and achieves the best recognition performance. We implement extensive experiments on three public datasets: UCF101 subset, Kinetics-Sounds and EPIC-Kitchens-55. The results demonstrate the effectiveness of audio-visual complementation. Compared with the state-of-the-art multi-modal methods with sound (i.e. MM-ViT, AdaMML and G-Blend), the proposed LGMI-Net achieves superior accuracies of 96.05%, 88.37% and 50.3% with the 4.49×, 1.40× and 6.02× lower giga floating point operations (GFLOPs) respectively.

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Multimodal fusion for audio-image and video action recognition

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

See, Move and Hear: a Local-to-global Multi-Modal Interaction Network for Video Action Recognition.

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion

CCMA: CapsNet for audio–video sentiment analysis using cross-modal attention

Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

MAViL: Masked Audio-Video Learners

Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities

Collaborative Attention Mechanism for Multi-View Action Recognition

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Self-supervised Contrastive Learning for Audio-Visual Action Recognition

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Integration of audio-visual information for multi-speaker multimedia speaker recognition