Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

Jing Xu,Daxin Tan,Jiaqi Wang,Xiao Chen

2024-09-17

Abstract:While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to the monolingual scenario, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, speech generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.

Audio and Speech Processing,Computation and Language,Sound

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Unification of Multilingual Speech Generation and Recognition Tasks**: Currently, large language models (LLMs) typically handle speech generation and recognition tasks separately and are mainly limited to monolingual scenarios. The paper proposes a model named MultiLingual MultiTask (MLMT), which can integrate multilingual speech generation and recognition tasks within a single model. 2. **Efficient Utilization of Code-Switching (CS) Data**: The paper develops an effective data construction method by segmenting and concatenating words from different languages to build a code-switching dataset. This enables the model to handle code-switching speech synthesis without relying on high-quality code-switching data. 3. **Improving Performance in Multilingual Speech Tasks**: Experimental results show that the proposed MLMT model outperforms baseline models in speech generation and recognition tasks. Additionally, its data construction strategy not only endows the model with the ability to synthesize code-switching speech with high naturalness and clarity but also further enhances the performance of multilingual speech generation and recognition tasks.

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback

Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods.

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Tuning Large language model for End-to-end Speech Translation

Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

Multi-level Contrastive Learning for Cross-lingual Spoken Language Understanding

LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Learning Language and Speaker Information for Code-Switch Speech Synthesis with Limited Data.

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

FC-MTLF: A Fine- and Coarse-grained Multi-Task Learning Framework for Cross-Lingual Spoken Language Understanding.

MM-LLMs: Recent Advances in MultiModal Large Language Models

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis