Instruction-Following Speech Recognition

Cheng-I Jeff Lai,Zhiyun Lu,Liangliang Cao,Ruoming Pang
2023-09-18
Abstract:Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.
Computation and Language,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is the lack of flexibility in traditional end-to-end automatic speech recognition (ASR) models when handling complex user interactions. Although large language models (LLMs) have introduced more natural, text-prompt-based interactions in speech processing, the speech understanding and "reasoning" capabilities behind these models remain unclear. Therefore, the authors propose a new approach—instruction-following speech recognition—by training a Listen-Attend-Spell model to understand and execute various free-form text instructions, thereby achieving multiple speech recognition tasks without relying on predefined command sets. Specifically, the main contributions of the paper include: 1. **Instruction-following training**: Through instruction-following training, the model can understand and execute diverse free-form text instructions, enabling various speech recognition tasks ranging from transcription to summary generation. 2. **No pre-trained modules required**: The model is trained from scratch, using only the Librispeech dataset, and can interpret and execute simple instructions without relying on large language models or pre-trained speech modules. 3. **Selective transcription**: The model can selectively transcribe based on instructions, such as "transcribe the first half and then stop listening," providing additional privacy and security. 4. **Skill diversification**: The model can master a range of ASR-related skills, including speech transcription, ignoring speech, word replacement, transcribing audio operations, as well as summarization and keyword extraction. Through these innovations, the paper demonstrates the significant potential of instruction-following training in enhancing the capabilities of foundational speech models.