Abstract:Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve class - aware sounding objects localization in complex audio - visual scenes, that is, in the absence of class labels, not only to locate the positions of sounding objects but also to identify their classes. Specifically, the paper focuses on how to accurately locate and identify sounding objects in complex scenes where multiple sounding objects and silent objects co - exist (such as the cocktail party scene) through the correspondence between audio and visual information, while filtering out silent objects.
### Main Challenges
1. **Classification without Additional Semantic Annotations**: How to distinguish and identify different classes of sounding objects without using additional semantic annotations.
2. **Determining Sounding Objects and Filtering Silent Objects**: How to determine which visual objects are making sounds based on the mixed sounds and exclude those silent objects.
### Solutions
To solve the above problems, the authors propose a two - stage progressive learning framework:
1. **Coarse - grained Audio - visual Correspondence in Single - source Scenes**:
- In the case of a single sound source, determine the sounding area through coarse - grained audio - visual correspondence.
- Extract the visual features of the sounding area as candidate object representations and establish a class - representing - object dictionary for extracting expressive visual features.
2. **Fine - grained Audio - visual Consistency Supervision in Multi - source Scenes**:
- In multi - source scenes (such as the cocktail party scene), use the established class - representing - object dictionary to generate class - aware object location maps.
- Through audio - visual consistency supervision, achieve the alignment of the distribution of sounding objects and the distribution of mixed sounds, thereby achieving fine - grained alignment of audio and the distribution of sounding objects.
### Experimental Verification
The experimental results show that this model can effectively locate and identify sounding objects on both real and synthetic video datasets and can filter out silent objects. In addition, this method can also be applied to unsupervised object detection tasks and has achieved reasonable performance.
### Formula Summary
- **Calculation of Audio - visual Location Map**:
\[
l(g(a), f(v))=\sigma\left(\frac{\text{conv}(g(a)^T f(v))}{||g(a)||_2 ||f(v)||_2}\right)
\]
where \( g(a) \) is the global audio descriptor, \( f(v) \) is the visual feature map, \( \sigma \) is the sigmoid activation function, and \(\text{conv}\) is 1×1 convolution.
- **Location Loss**:
\[
L_{\text{loc}} = L_{\text{bce}}(y_{\text{match}}, \text{GMP}(l(g(a_i^s), f(v_j^s))))
\]
where \( y_{\text{match}} \) indicates whether the audio and the image are from the same pair, \( L_{\text{bce}} \) is the binary cross - entropy loss, and GMP is global maximum pooling.
- **Weighted Pooling for Extracting Object Representations**:
\[
o_i=\frac{\sum_{m,n} f(v_i^s)(m,n)\odot l_i(m,n)}{\sum_{m,n} l_i(m,n)}
\]
where \( \odot \) is the Hadamard product.
- **Clustering Loss**:
\[
L_{\text{clu}}(D, y_i)=\sum_{i = 1}^{N_s}\min_{y_i}||o_i - D^T\cdot y_i||_2^2
\]
where \( y_i\in\{0,1\}^K \) and \( \sum y_i = 1 \).
- **Classification Loss**:
\[
L_{\text{cls}}=L_{\text{ce}}(y_i^*, h_a(a_i^s))+L_{\text{ce}}(y_i^*, h_v(v_i^s))
\]