Masked Co-Attention Model for Audio-Visual Event Localization

Hengwei Liu,Xiaodong Gu
DOI: https://doi.org/10.1007/s10489-023-05191-2
IF: 5.3
2024-01-01
Applied Intelligence
Abstract:The objective of Audio-Visual Event Localization (AVEL) is to leverage audio and video cues in a combined manner to localize video segments that contain audio-visual events and classify their respective categories. The primary focus is on enhancing the semantic consistency between the video and audio segments while mitigating the influence of unrelated segments. However, data from different modalities are encoded in separated spaces, leading to modality gap. To address this issue, we propose a model based on masked co-attention (MCA) mechanism to better explore the multi-modal correlations. In this approach, both intra and cross modal attention are employed to determine the correlation between visual and audio segments. Furthermore, we introduce a mask strategy of two levels. At the feature level, a random masking method is proposed to alleviate overfitting concerns during training. At the attention level, the mask is applied to the co-attention map to filter out redundant information, thereby obtaining fine-grained multi-modal embeddings. Our proposed framework MCA achieves state-of-the-art results on the publicly available AVE dataset.
What problem does this paper attempt to address?