Abstract:Visual and audio signals are concurrent and complementary types of modality in some video actions. A single visual modality limits the performance of video action recognition due to the similar appearance with subtle movements, such as tapping guitar and playing guitar. However, it is challenging to fuse audio-visual modalities since the heterogeneity gap caused by inconsistent distribution and representation of multi-modal data. In this paper, we propose a local-to-global multi-modal interaction network (LGMI-Net) that integrates RGB, optical flow with sound information. First, for the local multi-modal interaction, we propose a novel inter-modal channel recalibration (IMCR) block to learn a joint representation from different input modalities by recalibrating the channel information distribution of one modality according to another modality. Besides, we also propose a novel RGB modality aggregation (RMA) block to obtain the more robust appearance features by mixing optical flow and sound information. Second, for the global multi-modal interaction, we propose three distinct encoder modes: unitary, parallel and triplet encoders, to capture the global multi-modal representation. The unitary encoder has the lowest computational complexity. The parallel encoder utilizes RGB-guided attention to improve accuracy while maintaining the lightweight. The triplet encoder aggregates the self-attentions of different modalities and achieves the best recognition performance. We implement extensive experiments on three public datasets: UCF101 subset, Kinetics-Sounds and EPIC-Kitchens-55. The results demonstrate the effectiveness of audio-visual complementation. Compared with the state-of-the-art multi-modal methods with sound (i.e. MM-ViT, AdaMML and G-Blend), the proposed LGMI-Net achieves superior accuracies of 96.05%, 88.37% and 50.3% with the 4.49×, 1.40× and 6.02× lower giga floating point operations (GFLOPs) respectively.

Multi-Modal Multi-Action Video Recognition.

Multi-modality Fusion Network for Action Recognition.

Multi-cue Combination Network for Action-Based Video Classification.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

A 3D-CNN and LSTM Based Multi-Task Learning Architecture for Action Recognition.

Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval

See, Move and Hear: a Local-to-global Multi-Modal Interaction Network for Video Action Recognition.

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

Multi-level Multi-modal Feature Fusion for Action Recognition in Videos.

Representing Videos As Discriminative Sub-graphs for Action Recognition*

Human action recognition via multi-view learning.

Multi-cue based four-stream 3D ResNets for video-based action recognition

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Multi-View Time-Series Hypergraph Neural Network for Action Recognition

MRSN: Multi-Relation Support Network for Video Action Detection

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition

Video Visual Relation Detection Via Multi-modal Feature Fusion

Online video visual relation detection with hierarchical multi-modal fusion

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition