A Distinct Synthesizer Convolutional Tasnet For Singing Voice Separation

Congzhou Tian,Deshun Yang,Xiaoou Chen
DOI: https://doi.org/10.1007/978-3-030-37731-1_4
2020-01-01
Abstract:Deep learning methods have already been used for music source separation for several years and proved to be very effective. Most of them choose Fourier Transform as the front-end process to get a spectrogram representation, which has its drawback though. Perhaps the spectrogram representation is just suitable for human to understand sounds, but not the best representation used by powerful neural networks for singing voice separation. TasNet (Time Audio Separation Network) has been proposed recently to solve monaural speech separation in the time domain by modeling each source as a weighted sum of a common set of basis signals. Then the fully-convolutional TasNet raised recently achieves great improvements in speech separation. In this paper, we first show convolutional TasNet can also be used in singing voice separation and bring about improvements on the dataset DSD100 in the singing voice separation task. Then based on the fact that in singing voice separation, the difference between the singing voice and the accompaniment is far more remarkable than the difference between the voices of two different people in speech separation, we employ separate sets of basis signals and separate encoder outputs for the singing voice and the accompaniment respectively, which makes a further improved model, distinct synthesizer convolutional TasNet (ds-cTasNet).
What problem does this paper attempt to address?