Abstract:Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve class - aware sounding objects localization in complex audio - visual scenes, that is, in the absence of class labels, not only to locate the positions of sounding objects but also to identify their classes. Specifically, the paper focuses on how to accurately locate and identify sounding objects in complex scenes where multiple sounding objects and silent objects co - exist (such as the cocktail party scene) through the correspondence between audio and visual information, while filtering out silent objects. ### Main Challenges 1. **Classification without Additional Semantic Annotations**: How to distinguish and identify different classes of sounding objects without using additional semantic annotations. 2. **Determining Sounding Objects and Filtering Silent Objects**: How to determine which visual objects are making sounds based on the mixed sounds and exclude those silent objects. ### Solutions To solve the above problems, the authors propose a two - stage progressive learning framework: 1. **Coarse - grained Audio - visual Correspondence in Single - source Scenes**: - In the case of a single sound source, determine the sounding area through coarse - grained audio - visual correspondence. - Extract the visual features of the sounding area as candidate object representations and establish a class - representing - object dictionary for extracting expressive visual features. 2. **Fine - grained Audio - visual Consistency Supervision in Multi - source Scenes**: - In multi - source scenes (such as the cocktail party scene), use the established class - representing - object dictionary to generate class - aware object location maps. - Through audio - visual consistency supervision, achieve the alignment of the distribution of sounding objects and the distribution of mixed sounds, thereby achieving fine - grained alignment of audio and the distribution of sounding objects. ### Experimental Verification The experimental results show that this model can effectively locate and identify sounding objects on both real and synthetic video datasets and can filter out silent objects. In addition, this method can also be applied to unsupervised object detection tasks and has achieved reasonable performance. ### Formula Summary - **Calculation of Audio - visual Location Map**: \[ l(g(a), f(v))=\sigma\left(\frac{\text{conv}(g(a)^T f(v))}{||g(a)||_2 ||f(v)||_2}\right) \] where \( g(a) \) is the global audio descriptor, \( f(v) \) is the visual feature map, \( \sigma \) is the sigmoid activation function, and \(\text{conv}\) is 1×1 convolution. - **Location Loss**: \[ L_{\text{loc}} = L_{\text{bce}}(y_{\text{match}}, \text{GMP}(l(g(a_i^s), f(v_j^s)))) \] where \( y_{\text{match}} \) indicates whether the audio and the image are from the same pair, \( L_{\text{bce}} \) is the binary cross - entropy loss, and GMP is global maximum pooling. - **Weighted Pooling for Extracting Object Representations**: \[ o_i=\frac{\sum_{m,n} f(v_i^s)(m,n)\odot l_i(m,n)}{\sum_{m,n} l_i(m,n)} \] where \( \odot \) is the Hadamard product. - **Clustering Loss**: \[ L_{\text{clu}}(D, y_i)=\sum_{i = 1}^{N_s}\min_{y_i}||o_i - D^T\cdot y_i||_2^2 \] where \( y_i\in\{0,1\}^K \) and \( \sum y_i = 1 \). - **Classification Loss**: \[ L_{\text{cls}}=L_{\text{ce}}(y_i^*, h_a(a_i^s))+L_{\text{ce}}(y_i^*, h_v(v_i^s)) \]

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Multiple Sound Sources Localization from Coarse to Fine

Learning to Localize Sound Source in Visual Scenes

Active Object Discovery and Localization Using Sound-Induced Attention

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Self-supervised object detection from audio-visual correspondence

Curriculum Audiovisual Learning

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

Localizing Visual Sounds the Easy Way

Unsupervised Sound Localization via Iterative Contrastive Learning

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

Learning to Separate Object Sounds by Watching Unlabeled Video

Mix and Localize: Localizing Sound Sources in Mixtures

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

Self-supervised Moving Vehicle Tracking with Stereo Sound

Enhancing Sound Source Localization via False Negative Elimination

Audio-Visual Event Localization in Unconstrained Videos