Abstract:We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "SoundStorm: Efficient Parallel Audio Generation" aims to address the issues of efficient and high-quality audio generation. Specifically, the paper focuses on the following aspects: 1. **Long Audio Sequence Generation**: Traditional autoregressive models face high computational complexity and slow generation speed when generating long audio sequences. The paper proposes a non-autoregressive parallel decoding scheme that significantly improves generation speed while maintaining audio quality. 2. **Audio Quality and Consistency**: Maintaining consistency in speech and acoustic conditions when generating long audio is a challenge. The paper introduces a bidirectional attention mechanism and a confidence-based parallel decoding scheme to improve the quality and consistency of generated audio. 3. **Efficient Modeling**: Neural audio codecs (such as SoundStream) produce multi-level tokens with a specific hierarchical structure. The paper proposes a model architecture that adapts to this hierarchical structure, efficiently handling the task of generating long audio sequences. ### Main Contributions - **Model Architecture**: The SoundStorm model is proposed, which uses a bidirectional attention mechanism and Conformer network to efficiently predict masked audio tokens. - **Parallel Decoding**: A confidence-based parallel decoding scheme is introduced, which can generate multiple tokens in parallel over several iterations, significantly improving generation speed. - **Performance Improvement**: Experimental results show that SoundStorm takes only 0.5 seconds to generate 30 seconds of audio, which is two orders of magnitude faster than the autoregressive generation method of AudioLM, while producing audio of comparable or better quality. - **Application Extension**: The application of SoundStorm in dialogue synthesis is demonstrated, capable of generating natural and high-quality dialogue segments based on transcription text and speaker prompts. ### Experimental Validation - **Speech Recognition Rate**: By measuring the word error rate (WER) and character error rate (CER) of the generated audio, it is verified that SoundStorm performs better than AudioLM on audio of different lengths. - **Speaker Preservation**: By calculating the cosine similarity between the generated audio and the prompt audio, it is verified that SoundStorm excels in preserving speaker identity. - **Acoustic Consistency**: By training a model to evaluate the acoustic consistency of generated audio over long periods, results show that SoundStorm maintains high acoustic consistency in long audio generation. - **Audio Quality**: Using a MOS evaluator to estimate the perceptual quality of the generated audio, results show that SoundStorm's audio quality is comparable to that of AudioLM, which has been proven to be comparable to real audio quality. In summary, the paper addresses the key issues of efficient and high-quality audio generation by proposing the SoundStorm model and validates its effectiveness and superiority through multiple experiments.

SoundStorm: Efficient Parallel Audio Generation

Efficient Parallel Audio Generation Using Group Masked Language Modeling

AudioLM: a Language Modeling Approach to Audio Generation

Audiobox: Unified Audio Generation with Natural Language Prompts

SoundStream: An End-to-End Neural Audio Codec

AudioLCM: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps

Efficient Neural Music Generation

Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Fast Timing-Conditioned Latent Audio Diffusion

StemGen: A music generation model that listens

Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

SpeedySpeech: Efficient Neural Speech Synthesis

Low-latency Speech Enhancement via Speech Token Generation

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls