Abstract:Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text $y$ into speech $\hat{x}$, and the ASR model leverages the transformed pair $(\hat{x},y)$ for training, and vice versa, to boost the accuracy of the two tasks; (3) bidirectional sequence modeling, which addresses error propagation especially in the long speech and text sequence when training with few paired data; (4) a unified model structure, which combines all the above components for TTS and ASR based on Transformer model. Our method achieves 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data.

Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

End-to-end Joint Punctuated and Normalized ASR with a Limited Amount of Punctuated Training Data

Incorporating External POS Tagger for Punctuation Restoration

Multimodal Punctuation Prediction with Contextual Dropout

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

Automatic punctuation generation for speech

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

A Semi-Supervised Complementary Joint Training Approach for Low-Resource Speech Recognition

Self-Attention Based Model For Punctuation Prediction Using Word And Speech Embeddings

LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition

Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus

Transfer knowledge for punctuation prediction via adversarial training

CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR

An Efficient Architecture for Predicting the Case of Characters using Sequence Models

Joint Speech-Text Embeddings for Multitask Speech Processing

Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person

Improving Joint Speech-Text Representations Without Alignment

Almost Unsupervised Text to Speech and Automatic Speech Recognition