On decoder-only architecture for speech-to-text and large language model integration

Jian Wu,Yashesh Gaur,Zhuo Chen,Long Zhou,Yimeng Zhu,Tianrui Wang,Jinyu Li,Shujie Liu,Bo Ren,Linquan Liu,Yu Wu

2023-10-02

Abstract:Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

Audio and Speech Processing,Computation and Language,Sound

What problem does this paper attempt to address?

The main problem this paper attempts to address is the seamless integration of speech signals into large language models (LLMs), specifically exploring the application of decoder-only architecture in speech processing tasks. Specifically, the researchers propose a new method called Speech-LLaMA, which effectively incorporates acoustic information into text-based large language models, thereby improving the performance of speech-to-text tasks. The key challenges mentioned in the paper include: 1. **Modality Alignment**: Since speech signals are usually longer than text sequences, aligning these two modalities in a pre-trained LLM is a challenge. 2. **Cost-Effectiveness**: Considering the high cost of training LLMs, minimizing the overall integration cost while maintaining high performance is also an important research direction. 3. **Potential of Decoder-Only Architecture**: Given the success of LLMs, researchers hope to explore the untapped potential of decoder-only architecture as the foundational network architecture for speech-to-text processing. To address these issues, the researchers designed a simple yet effective architecture that combines large language models with acoustic information by introducing an acoustic feature compressor and an acoustic encoder, enabling the LM to generate corresponding text under conditional prompts. Experimental results show that this method significantly outperforms strong baseline models in multilingual speech-to-text translation tasks, and the decoder-only model trained from scratch can achieve comparable performance with approximately 40% fewer parameters, validating the potential of decoder-only models in general speech-to-text modeling.

On decoder-only architecture for speech-to-text and large language model integration

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Using Large Language Model for End-to-End Chinese ASR and NER

Connecting Speech Encoder and Large Language Model for ASR

Prompting Large Language Models with Speech Recognition Abilities

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Faster Speech-LLaMA Inference with Multi-token Prediction

A Survey on Speech Large Language Models

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Efficient Streaming LLM for Speech Recognition

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Large Language Model Enabled Semantic Communication Systems

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Integrated Method of Deep Learning and Large Language Model in Speech Recognition

Transferable speech-to-text large language model alignment module

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

On the Uses of Large Language Models to Design End-to-End Learning Semantic Communication