Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing

Xin Sun,Xuan Wang,Qiong Liu,Xi Zhou
DOI: https://doi.org/10.1109/lsp.2024.3388957
2024-04-27
IEEE Signal Processing Letters
Abstract:The weakly-supervised audio-visual video parsing (AVVP) task aims toparse a video into temporal events and predict their modality-specific categories. Current works primarily focus on refining training strategies and follow the framework fusing signals only at the segment level. However, they miss the point that video events, being composed of consecutive segments, require the integration of both local and global contexts to fully capture their essence. In this letter, we present the Local-Global Fusion Network (LGFNet), designed to facilitate multi-level interaction between audio and visual signals. Specifically, we create a two-dimensional map to generate multi-scale event proposals for both audio and visual modalities. Subsequently, we fuse audio and visual signals at both segment and event levels with a novel boundary-aware feature aggregation method, enabling the simultaneous capture of local and global information. To enhance the temporal alignment between the two modalities, we employ segment-level and event-level contrastive learning. In-depth experiments demonstrate the superiority of our LGFNet.
engineering, electrical & electronic
What problem does this paper attempt to address?