Abstract:Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models to reveal shared interpretable concepts. These concepts are passed to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **sound - prompted segmentation**. Specifically, the author aims to segment sound - related areas in an image through audio signals without training the model for specific tasks. ### Problem Background Most of the existing methods solve this problem by fine - tuning pre - trained models or training additional modules. These methods rely on a large amount of training data and complex architecture designs to achieve the alignment of audio and visual features. However, this method not only requires a large amount of computing resources but may also lead to over - fitting of the model to specific tasks, thereby reducing its generalization ability. ### Main Contributions of the Paper To solve the above problems, the author proposes a new method named TACO (Training - free Audio - visual CO - factorization). The core idea of TACO is to use **Non - negative Matrix Factorization (NMF)** to jointly decompose audio and visual features, thereby revealing the shared interpretable concepts between the two. In this way, TACO can achieve high - precision sound - prompted image segmentation without any task - specific training. ### Main Technical Features 1. **Training - free Paradigm**: TACO completely avoids the task - specific training process and only relies on pre - trained audio and image models (such as CLIP and CLAP), thus retaining the strong generalization ability of these models. 2. **Interpretability**: Through the NMF framework, TACO can clearly visualize the relationship between the segmentation output and the concepts identified in the signal, providing semantically interpretable results. 3. **High Performance**: Experimental results show that TACO significantly outperforms existing unsupervised methods on multiple benchmark datasets. ### Method Overview The specific workflow of TACO is as follows: - **Input Processing**: Encode the audio and image into feature matrices \(X_A\) and \(X_I\) respectively. - **Joint Decomposition**: Decompose \(X_A\) and \(X_I\) into activation matrices \(U_A\), \(U_I\) and concept matrices \(V_A\), \(V_I\) through soft co - NMF. - **Semantic Matching**: Introduce semantic anchors to ensure the semantic consistency of audio and visual features. - **Segmentation Generation**: Use the factor matrix \(V_I\) obtained from the decomposition to prompt a pre - trained open - vocabulary segmentation model (such as FC - CLIP), thereby generating an accurate segmentation map. ### Experimental Verification The author has verified the effectiveness of TACO through extensive experiments, including quantitative and qualitative analyses. The results show that TACO has achieved superior performance on multiple datasets, especially in single - source and multi - source segmentation tasks. ### Summary TACO proposes a completely new training - free method for sound - prompted image segmentation. It not only avoids the complex training process but also provides interpretability and high performance through the NMF framework, bringing a new research direction to the field of audio - visual perception.

TACO: Training-free Sound Prompted Segmentation via Deep Audio-visual CO-factorization

Self-supervised Audio-visual Co-segmentation

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

Object Segmentation with Audio Context

Self-Supervised Segmentation and Source Separation on Videos.

Weakly-supervised Audio-visual Sound Source Detection and Separation

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation

Audio-Visual Model Distillation Using Acoustic Images

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Audio-Visual Segmentation

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Audio-Visual Segmentation with Semantics

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Curriculum Audiovisual Learning

Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing