Abstract:In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

Improving Transformer-Based Speech Recognition with Unsupervised Pre-Training and Multi-Task Semantic Knowledge Learning.

A Further Study of Unsupervised Pre-training for Transformer Based Speech Recognition

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Research Status and Prospect of Transformer in Speech Recognition

End-to-End Multi-speaker Speech Recognition with Transformer.

The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition.

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Multitask Learning and Joint Optimization for Transformer-RNN-Transducer Speech Recognition

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Almost Unsupervised Text to Speech and Automatic Speech Recognition

SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION

Transformer with Bidirectional Decoder for Speech Recognition

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Semantic Mask for Transformer Based End-to-End Speech Recognition

Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identification

Improving Mandarin Speech Recogntion with Block-augmented Transformer

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.