Abstract:Due to the natural multi-modal occurrence format (text, audio, vision) of the dialogues, textual response generation in dialogues should rely on the multi-modal contexts beyond text only. However, most existing studies normally ignore the rich information of other modalities, such as audio. To investigate the importance of the acoustic contexts, we explore the multi-modal dialogue scenario with aligned text and audio temporal sequences for textual response generation of an assumed system, namely RGMD task. To this end, we construct a new multi-modal dataset for this task based on TV shows, which contains 84.9K utterances. Considering the response diversity limited by the context and modality interactions for RGMD, we attempt the split pre-generation (SPG) strategy and the cross-modal contrastive learning (CCL) strategy in multi-modal pre-training for better response generation. On the one hand, with SPG, we can obtain many diverse responses without the restrictions of too many historical mixed multi-modal contexts. On the other hand, with CCL, we can capture the interactions between text and audio. Extensive experiments demonstrate that our approach based on BART can consistently perform better than the state-of-the-art textual approach DP by 4.17%, 8.96%, 2.43%, 1.04% and 7.54% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Moreover, our approach based on GPT can outperform the state-of-the-art multi-modal approach RLM by 6.79%, 9.25%, 7.49%, 9.31% and 13.75% regarding metrics of BLEU, DIST, ROUGE, METEOR and NIST, respectively. Besides, we conduct much in-depth analysis, showing the necessity of audio for response generation and further verifying the effectiveness of our approach.

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents

Speak While You Think: Streaming Speech Synthesis During Text Generation

Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation

Human Latency Conversational Turns for Spoken Avatar Systems

A Full-duplex Speech Dialogue Scheme Based On Large Language Models

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM

Language Model Can Listen While Speaking

Text-Free Prosody-Aware Generative Spoken Language Modeling

Long-Form Speech Generation with Spoken Language Models

Efficient Parallel Audio Generation using Group Masked Language Modeling

Response Generation in Multi-Modal Dialogues with Split Pre-Generation and Cross-Modal Contrasting

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

Predictive Simultaneous Interpretation: Harnessing Large Language Models for Democratizing Real-Time Multilingual Communication

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

AudioPaLM: A Large Language Model That Can Speak and Listen

Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation