Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Jinqiao Dou,Xi Chen,Yuehai Wang
DOI: https://doi.org/10.1109/IJCNN54540.2023.10191112
2023-01-01
Abstract:Audio and visual signals usually coexist in realistic scenes, and human brains can learn this multi-modal perception easily. So, it is crucial for the computer to learn how human brains work for solving multi-modal tasks. The Audio-visual event localization (AVEL) task involves two sub-tasks: find video segments that contain Audio-visual events, and determine the category of the events. However, the AVEL task remains challenging due to the severe background noise. Additionally, processing information from both modalities simultaneously is also a tough issue. The current approaches have two main problems. One is that the network tends to be influenced by noise and predicts unreasonable events for consecutive segments within the same video clip. The other is that the model will oscillate between the local and global targets due to the multi-objective learning. To address these problems, we propose a decoupling multi-modal fusion network, which not only suppresses the complex noise but also learns the local and global information exclusively. The proposal consists of two sub-networks: the Which-event sub-network for predicting the event category and the Is-event sub-network for determining the time boundary for the event. We evaluate our method on the standard AVE Dataset in both fully and weakly supervised settings, and the results verify the effectiveness of our method.
What problem does this paper attempt to address?