Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Shiping Ge,Zhiwei Jiang,Yafeng Yin,Cong Wang,Zifeng Cheng,Qing Gu
DOI: https://doi.org/10.1145/3581783.3612506
2023-01-01
Abstract:Audio-Visual Event Localization (AVEL) aims to locate events that are both visible and audible in a video. Existing AVEL methods primarily focus on learning generic localization patterns that are applicable to all events. However, events often exhibit modality biases, such as visual-dominated, audio-dominated, or modality-balanced, which can lead to different localization preferences. These preferences may be overlooked by existing methods, resulting in unsatisfactory localization performance. To address this issue, this paper proposes a novel event-aware localization paradigm, which first identifies the event category and then leverages localization preferences specific to that event for improved event localization. To achieve this, we introduce a memory-assisted metric learning framework, which utilizes historic segments as anchors to adjust the unified representation space for both event classification and event localization. To provide sufficient information for this metric learning, we design a spatial-temporal audio-visual fusion encoder to capture the spatial and temporal interaction between audio and visual modalities. Extensive experiments on the public AVE dataset in both fully-supervised and weakly-supervised settings demonstrate the effectiveness of our approach. Code will be released at https://github.com/ShipingGe/AVEL.
What problem does this paper attempt to address?