SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

Zhehuai Chen,He Huang,Andrei Andrusenko,Oleksii Hrinchuk,Krishna C. Puvvada,Jason Li,Subhankar Ghosh,Jagadeesh Balam,Boris Ginsburg
2023-10-14
Abstract:We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.
Computation and Language,Human-Computer Interaction,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Constructing a unified multi - task speech - language model**: The paper proposes a new Speech - Augmented Language Model (SALM), aiming to utilize the multi - task capabilities of large - language models (LLMs) to construct a unified model that can handle multiple speech tasks. These tasks include Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). In this way, SALM can not only achieve performance comparable to the Conformer baseline models for specific tasks, but also learn new tasks through examples in the context without additional training, that is, zero - shot in - context learning. 2. **Enhancing the context - learning ability of speech models**: Besides constructing a unified multi - task model, the paper also explores how to use the context - learning ability of LLMs to enhance speech models. Specifically, the author evaluates and improves the context - learning ability in speech - understanding tasks by introducing the keyword - boosting task. In addition, the paper proposes a speech - supervised in - context training method to further enhance the model's context - learning ability in different tasks. ### Main contributions - **Proposing the SALM model**: SALM is a unified multi - task speech - language model framework, which combines a frozen text LLMs, an audio encoder, a modality - adaptation module and LoRA layers to adapt to speech inputs and related task instructions. Experimental results show that SALM achieves performance comparable to the Conformer baseline models for ASR and AST tasks, and the implementation code is open - sourced. - **Endowing speech - to - text models with zero - shot in - context learning ability for the first time**: Through the keyword - boosting tasks of ASR and AST, it is shown that SALM has the ability of zero - shot in - context learning. This is achieved by providing keywords as context cues without additional parameter updates to the model. - **Proposing the speech - supervised in - context training method**: In order to further enhance the model's context - learning ability, the paper proposes a new training method - speech - supervised in - context training (Speech ICT). This method helps the model make better use of context information by randomly sampling words in the training data and using them as context cues, thus showing better generalization ability on different tasks and datasets. ### Experimental results - **Multi - task modeling and context - learning**: The performance of SALM on ASR and AST tasks is comparable to that of the baseline models for specific tasks, and even better in some cases. Especially in the keyword - boosting task, SALM shows significant context - learning ability and can effectively improve the recognition accuracy of keywords through context cues. - **Zero - shot in - context learning**: Through the keyword - boosting task, the zero - shot in - context learning ability of SALM is verified. Experimental results show that SALM can learn keywords from the context and improve the recognition accuracy of keywords without additional training. - **Speech - supervised in - context training**: By introducing the speech - supervised in - context training method, the model's context - learning ability is further enhanced. Experiments show that this training method not only improves the model's performance on known words, but also generalizes to unseen words and datasets. In conclusion, this paper provides new solutions for speech - understanding and - generation tasks by constructing a unified multi - task speech - language model framework and enhancing the context - learning ability of speech models.