On decoder-only architecture for speech-to-text and large language model integration

Jian Wu,Yashesh Gaur,Zhuo Chen,Long Zhou,Yimeng Zhu,Tianrui Wang,Jinyu Li,Shujie Liu,Bo Ren,Linquan Liu,Yu Wu
2023-10-02
Abstract:Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The main problem this paper attempts to address is the seamless integration of speech signals into large language models (LLMs), specifically exploring the application of decoder-only architecture in speech processing tasks. Specifically, the researchers propose a new method called Speech-LLaMA, which effectively incorporates acoustic information into text-based large language models, thereby improving the performance of speech-to-text tasks. The key challenges mentioned in the paper include: 1. **Modality Alignment**: Since speech signals are usually longer than text sequences, aligning these two modalities in a pre-trained LLM is a challenge. 2. **Cost-Effectiveness**: Considering the high cost of training LLMs, minimizing the overall integration cost while maintaining high performance is also an important research direction. 3. **Potential of Decoder-Only Architecture**: Given the success of LLMs, researchers hope to explore the untapped potential of decoder-only architecture as the foundational network architecture for speech-to-text processing. To address these issues, the researchers designed a simple yet effective architecture that combines large language models with acoustic information by introducing an acoustic feature compressor and an acoustic encoder, enabling the LM to generate corresponding text under conditional prompts. Experimental results show that this method significantly outperforms strong baseline models in multilingual speech-to-text translation tasks, and the decoder-only model trained from scratch can achieve comparable performance with approximately 40% fewer parameters, validating the potential of decoder-only models in general speech-to-text modeling.