Abstract:Visual and audio signals are concurrent and complementary types of modality in some video actions. A single visual modality limits the performance of video action recognition due to the similar appearance with subtle movements, such as tapping guitar and playing guitar. However, it is challenging to fuse audio-visual modalities since the heterogeneity gap caused by inconsistent distribution and representation of multi-modal data. In this paper, we propose a local-to-global multi-modal interaction network (LGMI-Net) that integrates RGB, optical flow with sound information. First, for the local multi-modal interaction, we propose a novel inter-modal channel recalibration (IMCR) block to learn a joint representation from different input modalities by recalibrating the channel information distribution of one modality according to another modality. Besides, we also propose a novel RGB modality aggregation (RMA) block to obtain the more robust appearance features by mixing optical flow and sound information. Second, for the global multi-modal interaction, we propose three distinct encoder modes: unitary, parallel and triplet encoders, to capture the global multi-modal representation. The unitary encoder has the lowest computational complexity. The parallel encoder utilizes RGB-guided attention to improve accuracy while maintaining the lightweight. The triplet encoder aggregates the self-attentions of different modalities and achieves the best recognition performance. We implement extensive experiments on three public datasets: UCF101 subset, Kinetics-Sounds and EPIC-Kitchens-55. The results demonstrate the effectiveness of audio-visual complementation. Compared with the state-of-the-art multi-modal methods with sound (i.e. MM-ViT, AdaMML and G-Blend), the proposed LGMI-Net achieves superior accuracies of 96.05%, 88.37% and 50.3% with the 4.49×, 1.40× and 6.02× lower giga floating point operations (GFLOPs) respectively.

Multimodal fusion for audio-image and video action recognition

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Human Action Recognition Using Deep Multilevel Multimodal (M2) Fusion of Depth and Inertial Sensors

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data

Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Early, intermediate and late fusion strategies for robust deep learning-based multimodal action recognition

Human-centric multimodal fusion network for robust action recognition

Multi-modality Fusion Network for Action Recognition.

Enhancing Human Action Recognition and Violence Detection Through Deep Learning Audiovisual Fusion

Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning

MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion

Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Audio-Visual Fusion Based on Interactive Attention for Person Verification

See, Move and Hear: a Local-to-global Multi-Modal Interaction Network for Video Action Recognition.