Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Ling Xing,Hongyu Qu,Rui Yan,Xiangbo Shu,Jinhui Tang
2024-09-12
Abstract:Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address a key issue in Dense Audio-Visual Events Localization (DAVE), which is identifying and locating all events that can be both heard and seen in untrimmed videos. Existing methods typically encode audio and visual representations independently and employ dense cross-modal attention to integrate multimodal information. However, this approach inevitably aggregates irrelevant noise and events, especially in complex and long videos, leading to inaccurate detection. To tackle this problem, the paper proposes LOCO (Locality-aware Cross-modal Correspondence learning framework), whose core idea is to explore the local temporal continuity of audio-visual events. This free and informative supervisory signal is used to guide the filtering of irrelevant information and to extract complementary multimodal information during both unimodal and cross-modal learning stages. Specifically, LOCO achieves this goal through the following two aspects: 1. **Locality-aware Correspondence Correction (LCC)**: Within a contrastive learning framework, LCC highlights modality-shared semantics by aligning similar audio-visual segment features without additional annotations. 2. **Cross-modal Dynamic Perception (CDP)**: In the cross-modal feature pyramid, CDP imposes local consistency in a data-driven manner to understand the local temporal patterns of audio-visual events. By combining LCC and CDP, LOCO significantly enhances the performance of the DAVE task and surpasses existing methods. Experimental results show that LOCO achieves notable improvements across multiple evaluation metrics, particularly under stringent thresholds.