TACO: Training-free Sound Prompted Segmentation via Deep Audio-visual CO-factorization

Hugo Malard,Michel Olvera,Stephane Lathuiliere,Slim Essid
2024-12-02
Abstract:Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models to reveal shared interpretable concepts. These concepts are passed to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.
Audio and Speech Processing,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **sound - prompted segmentation**. Specifically, the author aims to segment sound - related areas in an image through audio signals without training the model for specific tasks. ### Problem Background Most of the existing methods solve this problem by fine - tuning pre - trained models or training additional modules. These methods rely on a large amount of training data and complex architecture designs to achieve the alignment of audio and visual features. However, this method not only requires a large amount of computing resources but may also lead to over - fitting of the model to specific tasks, thereby reducing its generalization ability. ### Main Contributions of the Paper To solve the above problems, the author proposes a new method named TACO (Training - free Audio - visual CO - factorization). The core idea of TACO is to use **Non - negative Matrix Factorization (NMF)** to jointly decompose audio and visual features, thereby revealing the shared interpretable concepts between the two. In this way, TACO can achieve high - precision sound - prompted image segmentation without any task - specific training. ### Main Technical Features 1. **Training - free Paradigm**: TACO completely avoids the task - specific training process and only relies on pre - trained audio and image models (such as CLIP and CLAP), thus retaining the strong generalization ability of these models. 2. **Interpretability**: Through the NMF framework, TACO can clearly visualize the relationship between the segmentation output and the concepts identified in the signal, providing semantically interpretable results. 3. **High Performance**: Experimental results show that TACO significantly outperforms existing unsupervised methods on multiple benchmark datasets. ### Method Overview The specific workflow of TACO is as follows: - **Input Processing**: Encode the audio and image into feature matrices \(X_A\) and \(X_I\) respectively. - **Joint Decomposition**: Decompose \(X_A\) and \(X_I\) into activation matrices \(U_A\), \(U_I\) and concept matrices \(V_A\), \(V_I\) through soft co - NMF. - **Semantic Matching**: Introduce semantic anchors to ensure the semantic consistency of audio and visual features. - **Segmentation Generation**: Use the factor matrix \(V_I\) obtained from the decomposition to prompt a pre - trained open - vocabulary segmentation model (such as FC - CLIP), thereby generating an accurate segmentation map. ### Experimental Verification The author has verified the effectiveness of TACO through extensive experiments, including quantitative and qualitative analyses. The results show that TACO has achieved superior performance on multiple datasets, especially in single - source and multi - source segmentation tasks. ### Summary TACO proposes a completely new training - free method for sound - prompted image segmentation. It not only avoids the complex training process but also provides interpretability and high performance through the NMF framework, bringing a new research direction to the field of audio - visual perception.