Abstract:We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix. We explore how to achieve performance similar to large state-of-the-art source separation networks starting from a small, efficient model for real-time speech separation. Such a model is useful when memory and compute are limited and singing voice processing has to run with limited look-ahead. In practice, this is realised by adapting an existing mono model to handle stereo input. Improvements in quality are obtained by tuning model parameters and expanding the training set. Moreover, we highlight the benefits a stereo model brings by introducing a new metric which detects attenuation inconsistencies between channels. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial, confirming the effectiveness of our techniques in stringent listening tests.

What problem does this paper attempt to address?

The paper primarily focuses on the problem of Stereo Singing Voice Cancellation (SVC), which is a subtask of music source separation. The goal is to estimate the accompaniment background from a stereo mix, i.e., to remove the vocal parts. Specifically, the researchers aim to explore how to achieve performance similar to large advanced source separation networks based on a small and efficient real-time voice separation model from a resource-constrained perspective. To achieve this goal, the paper presents the following key contributions: 1. **Model adaptation for stereo input and output**: The research team modified the Conv-TasNet architecture to handle stereo input and produce stereo output. This design improved the consistency of vocal attenuation between the left and right channels and introduced a new stereo metric to validate this improvement. 2. **Stereo separation asymmetry measure**: A method for measuring stereo artifacts, called the stereo separation asymmetry measure, was proposed. It was demonstrated that the proposed stereo architecture helps reduce such artifacts. 3. **Experimental evaluation**: The proposed methods were validated through both objective and subjective evaluations. Objective evaluation used metrics such as Scale Invariant Source-to-Distortion Ratio (SI-SDR); subjective evaluation employed large-scale MUSHRA tests to confirm the effectiveness of the technique. The research results indicate that with appropriate training, relatively small and specially optimized models can achieve high-quality output comparable to large models under resource-constrained conditions. Additionally, the experiments emphasized the importance of the quality and quantity of training data for model performance.

Resource-constrained stereo singing voice cancellation

A Deep-Learning Based Framework for Source Separation, Analysis, and Synthesis of Choral Ensembles

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

Zero-Shot Duet Singing Voices Separation with Diffusion Models

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Neural Vocoder Feature Estimation for Dry Singing Voice Separation

Experiments on Blind Speech Separations

Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

Deep Learning Based Source Separation Applied To Choir Ensembles

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

Audiovisual Singing Voice Separation

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion

A Distinct Synthesizer Convolutional Tasnet For Singing Voice Separation

Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Single-Channel Blind Source Separation for Singing Voice Detection: A Comparative Study

A fully differentiable model for unsupervised singing voice separation

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Improving Choral Music Separation through Expressive Synthesized Data from Sampled Instruments