Abstract:Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower frame rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

More Speaking or More Speakers?

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Towards Robust Speech Representation Learning for Thousands of Languages

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Toward Leveraging Pre-Trained Self-Supervised Frontends for Automatic Singing Voice Understanding Tasks: Three Case Studies

VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning for Voice Generation

Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding.

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition