Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation

Xuechao Wu,Dongmei Li,Chao Ma,Xupeng Jia
DOI: https://doi.org/10.1109/icsip49896.2020.9339433
2020-01-01
Abstract:Speech separation is the core problem of audio signal processing and key pre-processing step for automatic speech recognition. Magnitude spectrogram is reported as the standard time-and-frequency cross-domain representation for speech signals. Approaches such as time-frequency (T-F) mask and mapping estimation have been proposed to estimate clean speech on magnitude spectrogram. Recently, this sequential task has made great progress in behalf of time-domain mask estimation and dilated temporal convolutional networks (TCN) as used in Conv-TasNet. In this work, we propose a framework properly integrating directions above, which result in a new monaural speech separation framework. We explore time-domain mapping-based algorithm which directly estimate clean speech features in end-to-end system. We also make use of an optimal scale-invariant signal to distortion ratio (OSI-SDR) loss function. We evaluate this framework on a newly released noisy speech separation dataset (WHAM) and obtain encouraging results in preliminary experiments. Finally, we show that 1-dim learned convolution encoder works well while extracting features as a encoder compared with others.
What problem does this paper attempt to address?