V-SlowFast Network for Efficient Visual Sound Separation

Lingyu Zhu,Esa Rahtu
DOI: https://doi.org/10.48550/arXiv.2109.08867
2021-09-21
Abstract:The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small- and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2% reduction in the number of model parameters and 81.4% reduction in GMACs compared to the previous multi-stage models. Project page: <a class="link-external link-https" href="https://ly-zhu.github.io/V-SlowFast" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to solve the problem of vision - guided sound separation. Specifically, the research objectives include: 1. **Research on visual sound separation on spectrograms with different time resolutions**: Explore the effect of sound separation on spectrograms with different time resolutions to capture different sound characteristics. 2. **Propose a new three - stream framework V - SlowFast**: This framework operates on visual frames, slow spectrograms, and fast spectrograms respectively. The slow spectrogram captures a coarse time resolution, while the fast spectrogram contains a fine - grained time resolution. 3. **Introduce contrastive learning objectives**: Through two contrastive learning objectives, encourage the network to learn discriminative visual features for better sound separation. 4. **Propose an audio - visual global attention module**: This module is used to fuse audio and visual features, enabling the model to focus on the target sound source. 5. **Improve performance on small - scale and large - scale datasets**: On datasets such as MUSIC - 21, AVE, and VGG - Sound, the V - SlowFast model outperforms the existing state - of - the - art methods in the single - frame - based visual sound separation task. 6. **Reduce model parameters and computational cost**: A small - scale V - SlowFast architecture variant is proposed, with a 74.2% reduction in the number of parameters and an 81.4% reduction in computational volume compared to previous multi - stage models. Through these objectives, the paper aims to provide an efficient and lightweight method to achieve vision - guided sound separation.