AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K. Rubenstein,Chulayuth Asawaroengchai,Duc Dung Nguyen,Ankur Bapna,Zalán Borsos,Félix de Chaumont Quitry,Peter Chen,Dalia El Badawy,Wei Han,Eugene Kharitonov,Hannah Muckenhirn,Dirk Padfield,James Qin,Danny Rozenberg,Tara Sainath,Johan Schalkwyk,Matt Sharifi,Michelle Tadmor Ramanovich,Marco Tagliasacchi,Alexandru Tudor,Mihajlo Velimirović,Damien Vincent,Jiahui Yu,Yongqiang Wang,Vicky Zayats,Neil Zeghidour,Yu Zhang,Zhishuai Zhang,Lukas Zilka,Christian Frank
2023-06-22
Abstract:We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at <a class="link-external link-https" href="https://google-research.github.io/seanet/audiopalm/examples" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Unified Speech and Text Processing**: By introducing AudioPaLM, a multimodal generative model that can handle both speech and text. It combines text-based language models (such as PaLM-2) and speech-based language models (such as AudioLM), thereby achieving understanding and generation of speech and text within a unified architecture. 2. **Improving Speech Processing Performance**: By leveraging large amounts of text pre-training data to initialize AudioPaLM, thereby improving the performance of speech tasks. This method fully utilizes the linguistic and common-sense knowledge in text language models, enhancing the effectiveness of speech processing. 3. **Zero-Shot Speech Translation**: For many input/target language combinations not seen during training, AudioPaLM can perform zero-shot speech-to-text translation tasks, indicating that the model possesses a certain degree of generalization capability. 4. **Voice Conversion Functionality**: AudioPaLM can transfer the speaker's voice characteristics in cross-lingual scenarios, i.e., maintaining the speaker's identity information such as pitch and intonation during speech-to-speech translation. 5. **Multi-Task Joint Training**: The paper demonstrates how to train the model in a mixed-task setting, including automatic speech recognition (ASR), speech-to-text translation (AST), and speech-to-speech translation (S2ST), and proves the effectiveness of this approach. 6. **Cross-Modal Integration**: By merging speech and text vocabularies into a unified vocabulary, the model is allowed to handle different types of inputs and outputs within the same framework, enabling more flexible task execution. In summary, the core of this paper is to construct a multimodal model capable of handling both speech and text simultaneously and to enhance its performance in speech-related tasks through various technical means.