Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization

Yukun Ma,Chong Zhang,Qian Chen,Wen Wang,Bin Ma
DOI: https://doi.org/10.1109/lsp.2024.3419719
2024-07-06
IEEE Signal Processing Letters
Abstract:Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understanding. Recently, the use of discretized speech features has gained attention as an efficient and compatible alternative to continuous features for LLMs. This is mainly due to their reduced storage requirements and better alignment of these features with LLM's input space. However, the typical practice of freezing the speech encoder during training poses challenges in bridging the modality gap between speech and text. To address this, we propose to use a mixed-scale re-tokenization layer, integrating multiple granularities in discretized speech features directly within the LLM's input module. Our experimental results demonstrated that the proposed method can effectively enhance the performance of ASR in the setting of continuous learning of an LLM, highlighting the importance of a meticulously designed input module for the integration of discretized speech features with an LLM.
engineering, electrical & electronic
What problem does this paper attempt to address?