Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

Vahid Noroozi,Zhehuai Chen,Somshubra Majumdar,Steve Huang,Jagadeesh Balam,Boris Ginsburg
2024-06-18
Abstract:In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.
Audio and Speech Processing,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The key problem that this paper attempts to solve is **the scarcity of labeled data faced by multimodal language models (models capable of processing text and speech inputs) during the training process**. Specifically, in order to train and evaluate large - scale language models that can process text and speech inputs simultaneously, a large number of data samples containing these two modalities are required. However, in reality, data sets that contain both text and speech labels are very limited, which restricts the improvement of model performance. To solve this problem, the paper proposes three methods for generating synthetic samples to expand the data sets available for training these models: 1. **Generate synthetic speech instruction data based on text data**: - Use a text - to - speech (TTS) system to convert text data into speech data, thereby generating samples containing both text and speech modalities. - For example, for a question - and - answer data set, convert the text parts of the questions and answers into speech through the TTS system, enabling the model to understand the relationship between text and speech. 2. **Generate text from labeled speech data**: - Utilize large - language models (LLMs) to generate relevant questions and answers based on existing speech transcripts, forming new text - speech paired samples. - This method can make full use of existing real - speech data, but the quality of the generated text may vary, so LLMs need to be used as filters to screen high - quality samples. 3. **Generate text from unlabeled speech data**: - Use an automatic speech recognition (ASR) system to generate pseudo - labels, and then use these pseudo - labels to generate text content. - This method can significantly expand the amount of available data, especially for resource - poor languages, as it does not rely on high - quality transcripts. Through these methods, the paper aims to enhance the model's understanding ability of text and speech through synthetic data generation and promote the learning of cross - modal relationships. Experimental results show that these methods can effectively improve the model's ability in joint understanding and task generalization.