Abstract:Monaural source separation (MSS) aims to extract and reconstruct different sources from a single-channel mixture, which could facilitate a variety of applications such as chord recognition, pitch estimation and automatic transcription. In this paper, we study the problem of separating vocals and instruments from monaural music mixture. Existing works for monaural source separation either utilize linear and shallow models (e.g., non-negative matrix factorization), or do not explicitly address the coupling and tangling of multiple sources in original input signals, hence they do not perform satisfactorily in real-world scenarios. To overcome the above limitations, we propose a novel end-to-end framework for monaural music mixture separation called Deep Representation-Decoupling Neural Networks (DRDNN). DRDNN takes advantages of both traditional signal processing methods and popular deep learning models. For each input of music mixture, DRDNN converts it to a two-dimensional time-frequency spectrogram using short-time Fourier transform (STFT), followed by stacked convolutional neural networks (CNN) layers and long-short term memory (LSTM) layers to extract more condensed features. Afterwards, DRDNN utilizes a decoupling component, which consists of a group of multi-layer perceptrons (MLP), to decouple the features further into different separated sources. The design of decoupling component in DRDNN produces purified single-source signals for subsequent full-size restoration, and can significantly improve the performance of final separation. Through extensive experiments on real-world dataset, we prove that DRDNN outperforms state-of-the-art baselines in the task of monaural music mixture separation and reconstruction.

Monophonic Singing Voice Separation Based on Deep Learning

Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation

Deep Representation-Decoupling Neural Networks for Monaural Music Mixture Separation

Hybrid Y-Net Architecture for Singing Voice Separation

Deep Learning Based Speech Separation Via NMF-Style Reconstructions.

Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

A Distinct Synthesizer Convolutional Tasnet For Singing Voice Separation

Voice and accompaniment separation in music using self-attention convolutional neural network

Deep Clustering and Conventional Networks for Music Separation: Stronger Together

Gen-Res-Net: A Novel Generative Model for Singing Voice Separation

Deep Learning Based Source Separation Applied To Choir Ensembles

Combining HMM-based melody extraction and NMF-based soft masking for separating voice and accompaniment from monaural audio

TF-Attention-Net: an End to End Neural Network for Singing Voice Separation

Research On Singing Voice Detection Based On A Long-Term Recurrent Convolutional Network With Vocal Separation And Temporal Smoothing

Multi-Stage Non-Negative Matrix Factorization for Monaural Singing Voice Separation

Towards Solving The Bottleneck Of Pitch-Based Singing Voice Separation

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

Audiovisual Singing Voice Separation

Singer separation for karaoke content generation