Abstract:In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

CAMP: A Unified Data Solution for Mandarin Speech Recognition Tasks

10 hours data is all you need

Adaptive data augmentation for mandarin automatic speech recognition

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

MAC: A unified framework boosting low resource automatic speech recognition

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

An efficient text augmentation approach for contextualized Mandarin speech recognition

Boosting Character-based Mandarin ASR via Chinese Pinyin Representation

Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

AISHELL-4 - An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario.

Robust Audio-Visual Mandarin Speech Recognition Based on Adaptive Decision Fusion and Tone Features

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

MASS: Multi-task anthropomorphic speech synthesis framework