Abstract:Semantic segmentation of images [11, 3] and sound source separation in audio [8, 4, 1] are two important and popular tasks in the computer vision and computational audition communities. Traditional approaches have relied on large, labeled datasets, but recent work has leveraged the natural correspondence between vision and sound to apply supervised learning without explicit labels. In this paper, we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [9]. This paper is a workshop edit of Rouditchenko et al. 2019 [5]. In the Mix-and-Separate framework proposed in [9], neural networks are trained on videos through selfsupervision to perform image segmentation and sound source separation. However, following training, the model could only be applied to videos with synchronized audio, limiting their use in real applications where synchronized data are not available. Here we seek to enable a system that can perform segmentation and separation tasks using test input containing only video frames or sound mixtures. We introduce a learning approach that disentangles concepts learned by neural networks, enabling independent inference of images and audio mixtures without needing to combine visual and auditory features. We evaluate performance on image-only and audio-only tasks, which was not possible using the previous model. Furthermore, we substantially extend the scale of previous work [9] by training on a video dataset of naturally occurring audio-visual events with 28 event categories and over 4000 videos [6]. The results show that we can achieve promising semantic segmentation and source source separation performance.

Self-Supervised Learning for Alignment of Objects and Sound

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Multiple Sound Sources Localization from Coarse to Fine

Learning to Separate Object Sounds by Watching Unlabeled Video

Self-supervised object detection from audio-visual correspondence

Co-Separating Sounds of Visual Objects

Active Object Discovery and Localization Using Sound-Induced Attention

Video-Guided Sound Source Separation

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Weakly-supervised Audio-visual Sound Source Detection and Separation

Self-supervised Audio-visual Co-segmentation

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Self-Supervised Segmentation and Source Separation on Videos.

Sound-Indicated Visual Object Detection for Robotic Exploration

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Self-supervised Moving Vehicle Tracking with Stereo Sound

Enhancing Sound Source Localization via False Negative Elimination

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

Sound Localization by Self-Supervised Time Delay Estimation