Abstract:Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: <a class="link-external link-https" href="https://x-codec-audio.github.io" rel="external noopener nofollow">this https URL</a> Code: <a class="link-external link-https" href="https://github.com/zhenye234/xcodec" rel="external noopener nofollow">this https URL</a>)

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the audio codecs (codec) used in current Audio Large Language Models (Audio LLM) are unable to effectively maintain semantic integrity when generating audio. Specifically, existing audio codecs such as EnCodec are mainly used for audio compression, and their original design was not to support the tasks of Audio LLM, which has led to accuracy problems and a high Word Error Rate (WER) in the generated audio content, such as word skipping and errors. ### Main Problems 1. **Loss of Semantic Information**: Existing codecs mainly focus on acoustic reconstruction and ignore semantic information, resulting in the generated audio being not semantically accurate enough. 2. **High Word Error Rate (WER)**: Methods for generating acoustic tokens based on text transcription (such as VALL - E) often produce inaccurate content due to semantic misunderstandings, resulting in a high WER. 3. **Complexity and Compatibility**: Some methods that attempt to separate voice content and timbre (such as SpeechTokenizer) may not be well - compatible with other audio LLM architectures, especially when unified token processing is required. ### Solutions To solve these problems, the paper proposes a new method - X - Codec. X - Codec enhances the semantic capabilities of the codec in the following ways: - **Introducing a Semantic Encoder**: Before the Residual Vector Quantization (RVQ) stage, use a pre - trained semantic encoder to extract semantic features. - **Semantic Reconstruction Loss**: After RVQ, introduce semantic reconstruction loss to ensure that the generated audio can better retain semantic information. ### Experimental Results Through experiments on tasks such as Text - to - Speech (TTS), music continuation, and text - to - sound synthesis, the paper proves the effectiveness of X - Codec. The experimental results show that X - Codec significantly reduces the WER and outperforms existing codec methods in multiple evaluation metrics, such as Sim - O, and the UTMOS score has also increased. ### Conclusion By introducing X - Codec, the paper successfully solves the problem of semantic information loss in existing audio codecs when generating audio and improves the quality and accuracy of audio generation.

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis

Codec-SUPERB: An In-Depth Analysis of Sound Codec Models

HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models

Towards audio language modeling -- an overview

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

SpatialCodec: Neural Spatial Speech Coding

Learning Source Disentanglement in Neural Audio Codec

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer