See, Move and Hear: a Local-to-global Multi-Modal Interaction Network for Video Action Recognition.

Feng Fan,Ming Yue,Hu Nannan,Zhou Jiangwan
DOI: https://doi.org/10.1007/s10489-023-04497-5
IF: 5.3
2023-01-01
Applied Intelligence
Abstract:Visual and audio signals are concurrent and complementary types of modality in some video actions. A single visual modality limits the performance of video action recognition due to the similar appearance with subtle movements, such as tapping guitar and playing guitar. However, it is challenging to fuse audio-visual modalities since the heterogeneity gap caused by inconsistent distribution and representation of multi-modal data. In this paper, we propose a local-to-global multi-modal interaction network (LGMI-Net) that integrates RGB, optical flow with sound information. First, for the local multi-modal interaction, we propose a novel inter-modal channel recalibration (IMCR) block to learn a joint representation from different input modalities by recalibrating the channel information distribution of one modality according to another modality. Besides, we also propose a novel RGB modality aggregation (RMA) block to obtain the more robust appearance features by mixing optical flow and sound information. Second, for the global multi-modal interaction, we propose three distinct encoder modes: unitary, parallel and triplet encoders, to capture the global multi-modal representation. The unitary encoder has the lowest computational complexity. The parallel encoder utilizes RGB-guided attention to improve accuracy while maintaining the lightweight. The triplet encoder aggregates the self-attentions of different modalities and achieves the best recognition performance. We implement extensive experiments on three public datasets: UCF101 subset, Kinetics-Sounds and EPIC-Kitchens-55. The results demonstrate the effectiveness of audio-visual complementation. Compared with the state-of-the-art multi-modal methods with sound (i.e. MM-ViT, AdaMML and G-Blend), the proposed LGMI-Net achieves superior accuracies of 96.05%, 88.37% and 50.3% with the 4.49×, 1.40× and 6.02× lower giga floating point operations (GFLOPs) respectively.
What problem does this paper attempt to address?