Resource-constrained stereo singing voice cancellation

Clara Borrelli,James Rae,Dogac Basaran,Matt McVicar,Mehrez Souden,Matthias Mauch
2024-01-23
Abstract:We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix. We explore how to achieve performance similar to large state-of-the-art source separation networks starting from a small, efficient model for real-time speech separation. Such a model is useful when memory and compute are limited and singing voice processing has to run with limited look-ahead. In practice, this is realised by adapting an existing mono model to handle stereo input. Improvements in quality are obtained by tuning model parameters and expanding the training set. Moreover, we highlight the benefits a stereo model brings by introducing a new metric which detects attenuation inconsistencies between channels. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial, confirming the effectiveness of our techniques in stringent listening tests.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily focuses on the problem of Stereo Singing Voice Cancellation (SVC), which is a subtask of music source separation. The goal is to estimate the accompaniment background from a stereo mix, i.e., to remove the vocal parts. Specifically, the researchers aim to explore how to achieve performance similar to large advanced source separation networks based on a small and efficient real-time voice separation model from a resource-constrained perspective. To achieve this goal, the paper presents the following key contributions: 1. **Model adaptation for stereo input and output**: The research team modified the Conv-TasNet architecture to handle stereo input and produce stereo output. This design improved the consistency of vocal attenuation between the left and right channels and introduced a new stereo metric to validate this improvement. 2. **Stereo separation asymmetry measure**: A method for measuring stereo artifacts, called the stereo separation asymmetry measure, was proposed. It was demonstrated that the proposed stereo architecture helps reduce such artifacts. 3. **Experimental evaluation**: The proposed methods were validated through both objective and subjective evaluations. Objective evaluation used metrics such as Scale Invariant Source-to-Distortion Ratio (SI-SDR); subjective evaluation employed large-scale MUSHRA tests to confirm the effectiveness of the technique. The research results indicate that with appropriate training, relatively small and specially optimized models can achieve high-quality output comparable to large models under resource-constrained conditions. Additionally, the experiments emphasized the importance of the quality and quantity of training data for model performance.