EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks

Shijia Liao,Shiyi Lan,Arun George Zachariah
2024-01-31
Abstract:The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain data performance, enabling the generation of HiFi audios by employing an extensive dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a Human-In-The-Loop artifact measurement toolkit, and expands the model to approximately 200 million parameters. Demonstrations of our work are available at
Sound,Artificial Intelligence,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem this paper attempts to address is the key limitations of current audio generation models in high-fidelity (HiFi) audio generation. Specifically: 1. **Spectral discontinuity and high-frequency blurring**: Existing audio generation models often exhibit issues of spectral discontinuity and high-frequency blurring when generating high-frequency audio, which affects the quality of the generated audio. 2. **Insufficient robustness to out-of-domain data**: Existing models perform poorly when handling out-of-domain data, especially in tasks such as music and singing synthesis. 3. **Limited dataset and model scale**: Most existing audio generation models use relatively small datasets and have a limited number of model parameters, which restricts the performance improvement of the models. 4. **Lack of effective objective evaluation metrics**: Existing evaluation metrics cannot effectively detect some subtle but perceptually significant artifacts, such as short-term spectral discontinuities. To address these issues, the paper introduces a new Generative Adversarial Network (GAN) model—Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN). EVA-GAN improves the quality and robustness of audio generation through the following enhancements: 1. **Expanded dataset and model scale**: Utilizes a large dataset containing 36,000 hours of 44.1kHz audio and expands the model parameters to approximately 200 million. 2. **Introduction of Context-Aware Module (CAM)**: This module significantly improves the model's performance without requiring additional computational resources. 3. **Innovative training process**: Includes the use of longer context windows, loss balancers, gradient checkpointing techniques, and improved activation functions to enhance training stability and efficiency. 4. **Human-machine interaction artifact measurement tool**: Developed a new evaluation tool that can more accurately detect and assess artifacts in generated audio, ensuring consistency with human subjective perception. With these improvements, EVA-GAN achieves significant performance enhancements in high-fidelity audio generation tasks, particularly excelling in spectral continuity and high-frequency detail.