Abstract:We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Constructing a unified multi - task speech - language model**: The paper proposes a new Speech - Augmented Language Model (SALM), aiming to utilize the multi - task capabilities of large - language models (LLMs) to construct a unified model that can handle multiple speech tasks. These tasks include Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). In this way, SALM can not only achieve performance comparable to the Conformer baseline models for specific tasks, but also learn new tasks through examples in the context without additional training, that is, zero - shot in - context learning. 2. **Enhancing the context - learning ability of speech models**: Besides constructing a unified multi - task model, the paper also explores how to use the context - learning ability of LLMs to enhance speech models. Specifically, the author evaluates and improves the context - learning ability in speech - understanding tasks by introducing the keyword - boosting task. In addition, the paper proposes a speech - supervised in - context training method to further enhance the model's context - learning ability in different tasks. ### Main contributions - **Proposing the SALM model**: SALM is a unified multi - task speech - language model framework, which combines a frozen text LLMs, an audio encoder, a modality - adaptation module and LoRA layers to adapt to speech inputs and related task instructions. Experimental results show that SALM achieves performance comparable to the Conformer baseline models for ASR and AST tasks, and the implementation code is open - sourced. - **Endowing speech - to - text models with zero - shot in - context learning ability for the first time**: Through the keyword - boosting tasks of ASR and AST, it is shown that SALM has the ability of zero - shot in - context learning. This is achieved by providing keywords as context cues without additional parameter updates to the model. - **Proposing the speech - supervised in - context training method**: In order to further enhance the model's context - learning ability, the paper proposes a new training method - speech - supervised in - context training (Speech ICT). This method helps the model make better use of context information by randomly sampling words in the training data and using them as context cues, thus showing better generalization ability on different tasks and datasets. ### Experimental results - **Multi - task modeling and context - learning**: The performance of SALM on ASR and AST tasks is comparable to that of the baseline models for specific tasks, and even better in some cases. Especially in the keyword - boosting task, SALM shows significant context - learning ability and can effectively improve the recognition accuracy of keywords through context cues. - **Zero - shot in - context learning**: Through the keyword - boosting task, the zero - shot in - context learning ability of SALM is verified. Experimental results show that SALM can learn keywords from the context and improve the recognition accuracy of keywords without additional training. - **Speech - supervised in - context training**: By introducing the speech - supervised in - context training method, the model's context - learning ability is further enhanced. Experiments show that this training method not only improves the model's performance on known words, but also generalizes to unseen words and datasets. In conclusion, this paper provides new solutions for speech - understanding and - generation tasks by constructing a unified multi - task speech - language model framework and enhancing the context - learning ability of speech models.

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

SALMONN: Towards Generic Hearing Abilities for Large Language Models

SLM: Bridge the thin gap between speech and text foundation models

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

End-to-End Speech Recognition Contextualization with Large Language Models

In-Context Retrieval-Augmented Language Models

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

AudioPaLM: A Large Language Model That Can Speak and Listen

SALSA: Speedy ASR-LLM Synchronous Aggregation

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization