Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Ling Xing,Hongyu Qu,Rui Yan,Xiangbo Shu,Jinhui Tang

2024-09-12

Abstract:Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address a key issue in Dense Audio-Visual Events Localization (DAVE), which is identifying and locating all events that can be both heard and seen in untrimmed videos. Existing methods typically encode audio and visual representations independently and employ dense cross-modal attention to integrate multimodal information. However, this approach inevitably aggregates irrelevant noise and events, especially in complex and long videos, leading to inaccurate detection. To tackle this problem, the paper proposes LOCO (Locality-aware Cross-modal Correspondence learning framework), whose core idea is to explore the local temporal continuity of audio-visual events. This free and informative supervisory signal is used to guide the filtering of irrelevant information and to extract complementary multimodal information during both unimodal and cross-modal learning stages. Specifically, LOCO achieves this goal through the following two aspects: 1. **Locality-aware Correspondence Correction (LCC)**: Within a contrastive learning framework, LCC highlights modality-shared semantics by aligning similar audio-visual segment features without additional annotations. 2. **Cross-modal Dynamic Perception (CDP)**: In the cross-modal feature pyramid, CDP imposes local consistency in a data-driven manner to understand the local temporal patterns of audio-visual events. By combining LCC and CDP, LOCO significantly enhances the performance of the DAVE task and surpasses existing methods. Experimental results show that LOCO achieves notable improvements across multiple evaluation metrics, particularly under stringent thresholds.

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Leveraging Local Planar Motion Property for Robust Visual Matching and Localization.

Persistent Stereo Visual Localization on Cross-Modal Invariant Map

Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention

Audio-Visual Event Localization in Unconstrained Videos

Dual Attention Matching for Audio-Visual Event Localization.

Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization

CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization

Dynamic Interactive Learning Network for Audio-Visual Event Localization

Dense Modality Interaction Network for Audio-Visual Event Localization

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

A Two-Stage Framework for Multiple Sound-Source Localization

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Multi-Modulation Network for Audio-Visual Event Localization

Masked Co-Attention Model for Audio-Visual Event Localization

OpenAVE: Moving Towards Open Set Audio-Visual Event Localization

Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment