Abstract:In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

Task-Related Pretraining with Whole Word Masking for Chinese Coherence Evaluation.

Text Coherence Analysis Based on Deep Neural Network.

Iterative Task-adaptive Pretraining for Unsupervised Word Alignment

Pre-Training with Whole Word Masking for Chinese BERT

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Evaluating Text Coherence at Sentence and Paragraph Levels.

Toward High Quality Facial Representation Learning

Pretraining Multi-modal Representations for Chinese NER Task with Cross-Modality Attention

Train No Evil: Selective Masking for Task-Guided Pre-Training

Enhancing Coherence of Extractive Summarization with Multitask Learning

Pre-training Language Models for Comparative Reasoning

Learning to Rank Semantic Coherence for Topic Segmentation.

Modeling Coherence for Discourse Neural Machine Translation

DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting

Understanding Chinese Moral Stories with Further Pre-Training

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

A Novel Computational and Modeling Foundation for Automatic Coherence Assessment

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Different Strokes for Different Folks: Investigating Appropriate Further Pre-training Approaches for Diverse Dialogue Tasks

Segment, Mask, and Predict: Augmenting Chinese Word Segmentation with Self-Supervision