Abstract:Visual and audio signals are concurrent and complementary types of modality in some video actions. A single visual modality limits the performance of video action recognition due to the similar appearance with subtle movements, such as tapping guitar and playing guitar. However, it is challenging to fuse audio-visual modalities since the heterogeneity gap caused by inconsistent distribution and representation of multi-modal data. In this paper, we propose a local-to-global multi-modal interaction network (LGMI-Net) that integrates RGB, optical flow with sound information. First, for the local multi-modal interaction, we propose a novel inter-modal channel recalibration (IMCR) block to learn a joint representation from different input modalities by recalibrating the channel information distribution of one modality according to another modality. Besides, we also propose a novel RGB modality aggregation (RMA) block to obtain the more robust appearance features by mixing optical flow and sound information. Second, for the global multi-modal interaction, we propose three distinct encoder modes: unitary, parallel and triplet encoders, to capture the global multi-modal representation. The unitary encoder has the lowest computational complexity. The parallel encoder utilizes RGB-guided attention to improve accuracy while maintaining the lightweight. The triplet encoder aggregates the self-attentions of different modalities and achieves the best recognition performance. We implement extensive experiments on three public datasets: UCF101 subset, Kinetics-Sounds and EPIC-Kitchens-55. The results demonstrate the effectiveness of audio-visual complementation. Compared with the state-of-the-art multi-modal methods with sound (i.e. MM-ViT, AdaMML and G-Blend), the proposed LGMI-Net achieves superior accuracies of 96.05%, 88.37% and 50.3% with the 4.49×, 1.40× and 6.02× lower giga floating point operations (GFLOPs) respectively.

Multi-cue Combination Network for Action-Based Video Classification.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Multi-Modal Multi-Action Video Recognition.

Multi-cue based four-stream 3D ResNets for video-based action recognition

Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Multi-modality Fusion Network for Action Recognition.

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.

Action Recognition Using Co-trained Deep Convolutional Neural Networks.

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

Multi-Cue Information Fusion For Two-Layer Activity Recognition

Unified Spatio-Temporal Attention Networks for Action Recognition in Videos.

Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition.

Multipath Attention and Adaptive Gating Network for Video Action Recognition

Multi-scale residual network model combined with Global Average Pooling for action recognition

See, Move and Hear: a Local-to-global Multi-Modal Interaction Network for Video Action Recognition.

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network