CAMP: A Unified Data Solution for Mandarin Speech Recognition Tasks

Zeping Min,Qian Ge,Li Zhong
2023-01-01
Abstract:Speech recognition, the transformation of spoken language into written text, is becoming increasingly vital across a broad range of applications. Despite the advancements in end-to-end Neural Network (NN) based speech recognition systems, the requirement for large volumes of annotated audio data tailored to specific scenarios remains a significant challenge. To address this, we introduce a novel approach, the Character Audio Mix-up (CAMP), which synthesizes scenario-specific audio data for Mandarin at a significantly reduced cost and effort. This method concatenates the audio segments of each character’s Pinyin in the text, obtained through force alignment on an existing annotated dataset, to synthesize the audio. These synthesized audios are then used to train the Automatic Speech Recognition (ASR) models. Experiments conducted on the AISHELL-3, and AIDATATANG datasets validate the effectiveness of CAMP, with ASR models trained on CAMP synthesized data performing relatively well compared to those trained with actual data from these datasets. Further, our ablation study reveals that while synthesized audio data can significantly reduce the need for real annotated audio specific to each scenario, it cannot entirely replace real audio. Thus, the importance of real annotated audio data in specific application scenarios is emphasized.
What problem does this paper attempt to address?