Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Zhen Ye,Peiwen Sun,Jiahe Lei,Hongzhan Lin,Xu Tan,Zheqi Dai,Qiuqiang Kong,Jianyi Chen,Jiahao Pan,Qifeng Liu,Yike Guo,Wei Xue
2024-09-19
Abstract:Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: <a class="link-external link-https" href="https://x-codec-audio.github.io" rel="external noopener nofollow">this https URL</a> Code: <a class="link-external link-https" href="https://github.com/zhenye234/xcodec" rel="external noopener nofollow">this https URL</a>)
Audio and Speech Processing,Artificial Intelligence,Computation and Language,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the audio codecs (codec) used in current Audio Large Language Models (Audio LLM) are unable to effectively maintain semantic integrity when generating audio. Specifically, existing audio codecs such as EnCodec are mainly used for audio compression, and their original design was not to support the tasks of Audio LLM, which has led to accuracy problems and a high Word Error Rate (WER) in the generated audio content, such as word skipping and errors. ### Main Problems 1. **Loss of Semantic Information**: Existing codecs mainly focus on acoustic reconstruction and ignore semantic information, resulting in the generated audio being not semantically accurate enough. 2. **High Word Error Rate (WER)**: Methods for generating acoustic tokens based on text transcription (such as VALL - E) often produce inaccurate content due to semantic misunderstandings, resulting in a high WER. 3. **Complexity and Compatibility**: Some methods that attempt to separate voice content and timbre (such as SpeechTokenizer) may not be well - compatible with other audio LLM architectures, especially when unified token processing is required. ### Solutions To solve these problems, the paper proposes a new method - X - Codec. X - Codec enhances the semantic capabilities of the codec in the following ways: - **Introducing a Semantic Encoder**: Before the Residual Vector Quantization (RVQ) stage, use a pre - trained semantic encoder to extract semantic features. - **Semantic Reconstruction Loss**: After RVQ, introduce semantic reconstruction loss to ensure that the generated audio can better retain semantic information. ### Experimental Results Through experiments on tasks such as Text - to - Speech (TTS), music continuation, and text - to - sound synthesis, the paper proves the effectiveness of X - Codec. The experimental results show that X - Codec significantly reduces the WER and outperforms existing codec methods in multiple evaluation metrics, such as Sim - O, and the UTMOS score has also increased. ### Conclusion By introducing X - Codec, the paper successfully solves the problem of semantic information loss in existing audio codecs when generating audio and improves the quality and accuracy of audio generation.