Abstract:In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.

What problem does this paper attempt to address?

The key problem that this paper attempts to solve is **the scarcity of labeled data faced by multimodal language models (models capable of processing text and speech inputs) during the training process**. Specifically, in order to train and evaluate large - scale language models that can process text and speech inputs simultaneously, a large number of data samples containing these two modalities are required. However, in reality, data sets that contain both text and speech labels are very limited, which restricts the improvement of model performance. To solve this problem, the paper proposes three methods for generating synthetic samples to expand the data sets available for training these models: 1. **Generate synthetic speech instruction data based on text data**: - Use a text - to - speech (TTS) system to convert text data into speech data, thereby generating samples containing both text and speech modalities. - For example, for a question - and - answer data set, convert the text parts of the questions and answers into speech through the TTS system, enabling the model to understand the relationship between text and speech. 2. **Generate text from labeled speech data**: - Utilize large - language models (LLMs) to generate relevant questions and answers based on existing speech transcripts, forming new text - speech paired samples. - This method can make full use of existing real - speech data, but the quality of the generated text may vary, so LLMs need to be used as filters to screen high - quality samples. 3. **Generate text from unlabeled speech data**: - Use an automatic speech recognition (ASR) system to generate pseudo - labels, and then use these pseudo - labels to generate text content. - This method can significantly expand the amount of available data, especially for resource - poor languages, as it does not rely on high - quality transcripts. Through these methods, the paper aims to enhance the model's understanding ability of text and speech through synthetic data generation and promote the learning of cross - modal relationships. Experimental results show that these methods can effectively improve the model's ability in joint understanding and task generalization.

Instruction Data Generation and Unsupervised Adaptation for Speech Language Models

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition

A Survey on Data Synthesis and Augmentation for Large Language Models

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Text Generation with Speech Synthesis for ASR Data Augmentation

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Multimodal Large Language Models: A Survey

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space

Retrieving Multimodal Information for Augmented Generation: A Survey

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

AudioPaLM: A Large Language Model That Can Speak and Listen