Gen-Res-Net: A Novel Generative Model for Singing Voice Separation

Congzhou Tian,Hangyu Li,Deshun Yang,Xiaoou Chen
DOI: https://doi.org/10.1007/978-3-030-37731-1_3
2020-01-01
Abstract: In most cases, modeling in the time-frequency domain is the most common method to solve the problem of singing voice separation since frequency characteristics differ between different sources. During the past few years, applying recurrent neural network (RNN) to series of split spectrograms has been mostly adopted by researchers to tackle this problem. Recently, however, the U-net’s success has drawn the focus to treating the spectrogram as a 2-dimensional image with an auto-encoder structure, which indicates that some useful methods in image analysis may help solve this problem. Under this scenario, we propose a novel spectrogram-generative model to separate the two sources in the time-frequency domain inspired by Residual blocks, Squeeze and Excitation blocks and WaveNet. We apply none-reduce-sized Residual blocks together with Squeeze and Excitation blocks in the main stream to extract features of the input spectrogram while gathering the output layers in a skip-connection structure used in WaveNet. Experimental results on two datasets (MUSDB18 and CCMixer) have shown that our proposed network performs better than the current state-of-the-art approach working on spectrograms of mixtures – the deep U-net structure.
What problem does this paper attempt to address?