Abstract:Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10% improvement compared with WavLM on the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM.

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Efficient infusion of self-supervised representations in Automatic Speech Recognition

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Task-Agnostic Structured Pruning of Speech Representation Models

HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition

EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion

Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

TESSP: Text-Enhanced Self-Supervised Speech Pre-training

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Discriminative Speech Recognition Rescoring with Pre-trained Language Models

Improved Speech Pre-Training with Supervision-Enhanced Acoustic Unit