Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

Ye Bai,Chenxing Li,Hao Li,Yuanyuan Zhao,Xiaorui Wang
2024-04-17
Abstract:In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in scenarios such as short - videos and live - streaming, voices, singing voices and background music often overlap with each other, making it difficult to structure and recognize audio content. This complexity will affect the effectiveness of subsequent Automatic Speech Recognition (ASR) and music understanding applications. Specifically, these problems include: 1. **Complexity of Audio Signals**: In these scenarios, audio signals usually contain multiple sources, such as voices, singing voices, background music and sound effects. The overlap of these signals makes it more difficult to recognize voices and lyrics, and the accuracy rate drops significantly. 2. **Limitations of Existing Methods**: - **Cascade Systems**: The traditional method is to use a cascade system, that is, first separate the voices of multiple speakers from the mixed audio through a voice separation system, and then use the ASR system to recognize the content of each track. However, the mismatch between the separated audio and the natural audio will damage the recognition performance of the system. - **End - to - End Methods**: Although some end - to - end methods optimize the overall performance of the model, they cannot distinguish the types of separated tracks (for example, they cannot distinguish between voices and singing voices), and there is a permutation problem, which may cause the ASR model to be confused. To solve these problems, the paper proposes a model (JRSV) for jointly recognizing voices and singing voices based on Multi - Task Audio Source Separation (MTASS). The main objectives of this model are: - **Separation and Recognition**: Separate the mixed audio into independent voice and singing voice tracks, remove the background music, and recognize the content of these two tracks at the same time. - **Improve Robustness**: Further improve the robustness of recognition through online distillation technology. By constructing and releasing a benchmark dataset (Dual - Track Speech and Singing Voice Dataset, DTSVD), the experimental results show that JRSV can significantly improve the recognition accuracy on each track of the mixed audio.