V-SlowFast Network for Efficient Visual Sound Separation

Lingyu Zhu,Esa Rahtu

DOI: https://doi.org/10.48550/arXiv.2109.08867

2021-09-21

Abstract:The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram. The Slow spectrogram captures the coarse temporal resolution while the Fast spectrogram contains the fine-grained temporal resolution; iii) we introduce two contrastive objectives to encourage the network to learn discriminative visual features for separating sounds; iv) we propose an audio-visual global attention module for audio and visual feature fusion; v) the introduced V-SlowFast model outperforms previous state-of-the-art in single-frame based visual sound separation on small- and large-scale datasets: MUSIC-21, AVE, and VGG-Sound. We also propose a small V-SlowFast architecture variant, which achieves 74.2% reduction in the number of model parameters and 81.4% reduction in GMACs compared to the previous multi-stage models. Project page: <a class="link-external link-https" href="https://ly-zhu.github.io/V-SlowFast" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

This paper aims to solve the problem of vision - guided sound separation. Specifically, the research objectives include: 1. **Research on visual sound separation on spectrograms with different time resolutions**: Explore the effect of sound separation on spectrograms with different time resolutions to capture different sound characteristics. 2. **Propose a new three - stream framework V - SlowFast**: This framework operates on visual frames, slow spectrograms, and fast spectrograms respectively. The slow spectrogram captures a coarse time resolution, while the fast spectrogram contains a fine - grained time resolution. 3. **Introduce contrastive learning objectives**: Through two contrastive learning objectives, encourage the network to learn discriminative visual features for better sound separation. 4. **Propose an audio - visual global attention module**: This module is used to fuse audio and visual features, enabling the model to focus on the target sound source. 5. **Improve performance on small - scale and large - scale datasets**: On datasets such as MUSIC - 21, AVE, and VGG - Sound, the V - SlowFast model outperforms the existing state - of - the - art methods in the single - frame - based visual sound separation task. 6. **Reduce model parameters and computational cost**: A small - scale V - SlowFast architecture variant is proposed, with a 74.2% reduction in the number of parameters and an 81.4% reduction in computational volume compared to previous multi - stage models. Through these objectives, the paper aims to provide an efficient and lightweight method to achieve vision - guided sound separation.

V-SlowFast Network for Efficient Visual Sound Separation

Audiovisual SlowFast Networks for Video Recognition

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Leveraging Category Information for Single-Frame Visual Sound Source Separation

Time-Domain Audio-Visual Speech Separation on Low Quality Videos

Continual Audio-Visual Sound Separation

Weakly-supervised Audio-visual Sound Source Detection and Separation

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Visually Guided Sound Source Separation Using Cascaded Opponent Filter Network

Modeling Two-Stream Correspondence for Visual Sound Separation

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

Deep Audio-Visual Speech Separation with Attention Mechanism

Self-Supervised Fine-Grained Cycle-Separation Network (FSCN) for Visual-Audio Separation

SlowFast Networks for Video Recognition

High-Quality Visually-Guided Sound Separation from Diverse Categories

Audio-Visual Speech Enhancement Based on Multiscale Features and Parallel Attention

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Audiovisual Singing Voice Separation

Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model