Abstract:Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our <a class="link-external link-http" href="http://method.The" rel="external noopener nofollow">this http URL</a> code is available at <a class="link-external link-https" href="https://github.com/Bravo5542/VSCG" rel="external noopener nofollow">this https URL</a>.

Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Audio-Visual Event Localization Via Recursive Fusion by Joint Co-Attention

Audio-Visual Event Localization in Unconstrained Videos

Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization

CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization

Dual Attention Matching for Audio-Visual Event Localization.

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Masked Co-Attention Model for Audio-Visual Event Localization

Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization

CSS-Net: A Consistent Segment Selection Network for Audio-visual Event Localization

Semantic and Relation Modulation for Audio-Visual Event Localization

Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization.

Learning Event-Specific Localization Preferences for Audio-Visual Event Localization

Multi-Modulation Network for Audio-Visual Event Localization

Dynamic Interactive Learning Network for Audio-Visual Event Localization

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

Audio-Visual Event Localization based on Cross-Modal Interacting Guidance

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration