Efficient Parallel Audio Generation Using Group Masked Language Modeling

Myeonghun Jeong,Minchan Kim,Joun Yeop Lee,Nam Soo Kim
DOI: https://doi.org/10.1109/lsp.2024.3381910
2024-04-05
IEEE Signal Processing Letters
Abstract:We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling (G-MLM) and Group Iterative Parallel Decoding (G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.
engineering, electrical & electronic
What problem does this paper attempt to address?