Abstract:Semantic segmentation of images [11, 3] and sound source separation in audio [8, 4, 1] are two important and popular tasks in the computer vision and computational audition communities. Traditional approaches have relied on large, labeled datasets, but recent work has leveraged the natural correspondence between vision and sound to apply supervised learning without explicit labels. In this paper, we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [9]. This paper is a workshop edit of Rouditchenko et al. 2019 [5]. In the Mix-and-Separate framework proposed in [9], neural networks are trained on videos through selfsupervision to perform image segmentation and sound source separation. However, following training, the model could only be applied to videos with synchronized audio, limiting their use in real applications where synchronized data are not available. Here we seek to enable a system that can perform segmentation and separation tasks using test input containing only video frames or sound mixtures. We introduce a learning approach that disentangles concepts learned by neural networks, enabling independent inference of images and audio mixtures without needing to combine visual and auditory features. We evaluate performance on image-only and audio-only tasks, which was not possible using the previous model. Furthermore, we substantially extend the scale of previous work [9] by training on a video dataset of naturally occurring audio-visual events with 28 event categories and over 4000 videos [6]. The results show that we can achieve promising semantic segmentation and source source separation performance.

Video-Guided Sound Source Separation

Co-Separating Sounds of Visual Objects

Learning to Separate Object Sounds by Watching Unlabeled Video

Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Self-Supervised Learning for Alignment of Objects and Sound

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Weakly-supervised Audio-visual Sound Source Detection and Separation

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Semantic Grouping Network for Audio Source Separation

High-Quality Visually-Guided Sound Separation from Diverse Categories

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Self-Supervised Segmentation and Source Separation on Videos.

Listen and Look: Audio–Visual Matching Assisted Speech Source Separation

Acoustic Source Localization and Deconvolution-Based Separation

Visual Scene Graphs for Audio Source Separation

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Multiple Sound Sources Localization from Coarse to Fine

Separate Anything You Describe