Abstract:In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in scenarios such as short - videos and live - streaming, voices, singing voices and background music often overlap with each other, making it difficult to structure and recognize audio content. This complexity will affect the effectiveness of subsequent Automatic Speech Recognition (ASR) and music understanding applications. Specifically, these problems include: 1. **Complexity of Audio Signals**: In these scenarios, audio signals usually contain multiple sources, such as voices, singing voices, background music and sound effects. The overlap of these signals makes it more difficult to recognize voices and lyrics, and the accuracy rate drops significantly. 2. **Limitations of Existing Methods**: - **Cascade Systems**: The traditional method is to use a cascade system, that is, first separate the voices of multiple speakers from the mixed audio through a voice separation system, and then use the ASR system to recognize the content of each track. However, the mismatch between the separated audio and the natural audio will damage the recognition performance of the system. - **End - to - End Methods**: Although some end - to - end methods optimize the overall performance of the model, they cannot distinguish the types of separated tracks (for example, they cannot distinguish between voices and singing voices), and there is a permutation problem, which may cause the ASR model to be confused. To solve these problems, the paper proposes a model (JRSV) for jointly recognizing voices and singing voices based on Multi - Task Audio Source Separation (MTASS). The main objectives of this model are: - **Separation and Recognition**: Separate the mixed audio into independent voice and singing voice tracks, remove the background music, and recognize the content of these two tracks at the same time. - **Improve Robustness**: Further improve the robustness of recognition through online distillation technology. By constructing and releasing a benchmark dataset (Dual - Track Speech and Singing Voice Dataset, DTSVD), the experimental results show that JRSV can significantly improve the recognition accuracy on each track of the mixed audio.

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR

Mixture Encoder for Joint Speech Separation and Recognition

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

DJCM: A Deep Joint Cascade Model for Singing Voice Separation and Vocal Pitch Estimation

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

End-to-end Music-mixed Speech Recognition

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues

Joint Speech-Text Embeddings for Multitask Speech Processing

Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition.

Multi-stage music separation network with dual-branch attention and hybrid convolution

Audio-visual multi-channel speech separation, dereverberation and recognition

A Semi-Supervised Complementary Joint Training Approach for Low-Resource Speech Recognition

Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Multi-Band Multi-Resolution Fully Convolutional Neural Networks for Singing Voice Separation

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition