SoundStorm: Efficient Parallel Audio Generation

Zalán Borsos,Matt Sharifi,Damien Vincent,Eugene Kharitonov,Neil Zeghidour,Marco Tagliasacchi
2023-05-17
Abstract:We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper "SoundStorm: Efficient Parallel Audio Generation" aims to address the issues of efficient and high-quality audio generation. Specifically, the paper focuses on the following aspects: 1. **Long Audio Sequence Generation**: Traditional autoregressive models face high computational complexity and slow generation speed when generating long audio sequences. The paper proposes a non-autoregressive parallel decoding scheme that significantly improves generation speed while maintaining audio quality. 2. **Audio Quality and Consistency**: Maintaining consistency in speech and acoustic conditions when generating long audio is a challenge. The paper introduces a bidirectional attention mechanism and a confidence-based parallel decoding scheme to improve the quality and consistency of generated audio. 3. **Efficient Modeling**: Neural audio codecs (such as SoundStream) produce multi-level tokens with a specific hierarchical structure. The paper proposes a model architecture that adapts to this hierarchical structure, efficiently handling the task of generating long audio sequences. ### Main Contributions - **Model Architecture**: The SoundStorm model is proposed, which uses a bidirectional attention mechanism and Conformer network to efficiently predict masked audio tokens. - **Parallel Decoding**: A confidence-based parallel decoding scheme is introduced, which can generate multiple tokens in parallel over several iterations, significantly improving generation speed. - **Performance Improvement**: Experimental results show that SoundStorm takes only 0.5 seconds to generate 30 seconds of audio, which is two orders of magnitude faster than the autoregressive generation method of AudioLM, while producing audio of comparable or better quality. - **Application Extension**: The application of SoundStorm in dialogue synthesis is demonstrated, capable of generating natural and high-quality dialogue segments based on transcription text and speaker prompts. ### Experimental Validation - **Speech Recognition Rate**: By measuring the word error rate (WER) and character error rate (CER) of the generated audio, it is verified that SoundStorm performs better than AudioLM on audio of different lengths. - **Speaker Preservation**: By calculating the cosine similarity between the generated audio and the prompt audio, it is verified that SoundStorm excels in preserving speaker identity. - **Acoustic Consistency**: By training a model to evaluate the acoustic consistency of generated audio over long periods, results show that SoundStorm maintains high acoustic consistency in long audio generation. - **Audio Quality**: Using a MOS evaluator to estimate the perceptual quality of the generated audio, results show that SoundStorm's audio quality is comparable to that of AudioLM, which has been proven to be comparable to real audio quality. In summary, the paper addresses the key issues of efficient and high-quality audio generation by proposing the SoundStorm model and validates its effectiveness and superiority through multiple experiments.