Self-Supervised Segmentation and Source Separation on Videos.

Andrew Rouditchenko,Hang Zhao,Chuang Gan,Josh H. McDermott,Antonio Torralba
2019-01-01
Abstract:Semantic segmentation of images [11, 3] and sound source separation in audio [8, 4, 1] are two important and popular tasks in the computer vision and computational audition communities. Traditional approaches have relied on large, labeled datasets, but recent work has leveraged the natural correspondence between vision and sound to apply supervised learning without explicit labels. In this paper, we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [9]. This paper is a workshop edit of Rouditchenko et al. 2019 [5]. In the Mix-and-Separate framework proposed in [9], neural networks are trained on videos through selfsupervision to perform image segmentation and sound source separation. However, following training, the model could only be applied to videos with synchronized audio, limiting their use in real applications where synchronized data are not available. Here we seek to enable a system that can perform segmentation and separation tasks using test input containing only video frames or sound mixtures. We introduce a learning approach that disentangles concepts learned by neural networks, enabling independent inference of images and audio mixtures without needing to combine visual and auditory features. We evaluate performance on image-only and audio-only tasks, which was not possible using the previous model. Furthermore, we substantially extend the scale of previous work [9] by training on a video dataset of naturally occurring audio-visual events with 28 event categories and over 4000 videos [6]. The results show that we can achieve promising semantic segmentation and source source separation performance.
What problem does this paper attempt to address?