Abstract:The advent of Large Models marks a new era in machine learning, significantly outperforming smaller models by leveraging vast datasets to capture and synthesize complex patterns. Despite these advancements, the exploration into scaling, especially in the audio generation domain, remains limited, with previous efforts didn't extend into the high-fidelity (HiFi) 44.1kHz domain and suffering from both spectral discontinuities and blurriness in the high-frequency domain, alongside a lack of robustness against out-of-domain data. These limitations restrict the applicability of models to diverse use cases, including music and singing generation. Our work introduces Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN), yields significant improvements over previous state-of-the-art in spectral and high-frequency reconstruction and robustness in out-of-domain data performance, enabling the generation of HiFi audios by employing an extensive dataset of 36,000 hours of 44.1kHz audio, a context-aware module, a Human-In-The-Loop artifact measurement toolkit, and expands the model to approximately 200 million parameters. Demonstrations of our work are available at

What problem does this paper attempt to address?

The main problem this paper attempts to address is the key limitations of current audio generation models in high-fidelity (HiFi) audio generation. Specifically: 1. **Spectral discontinuity and high-frequency blurring**: Existing audio generation models often exhibit issues of spectral discontinuity and high-frequency blurring when generating high-frequency audio, which affects the quality of the generated audio. 2. **Insufficient robustness to out-of-domain data**: Existing models perform poorly when handling out-of-domain data, especially in tasks such as music and singing synthesis. 3. **Limited dataset and model scale**: Most existing audio generation models use relatively small datasets and have a limited number of model parameters, which restricts the performance improvement of the models. 4. **Lack of effective objective evaluation metrics**: Existing evaluation metrics cannot effectively detect some subtle but perceptually significant artifacts, such as short-term spectral discontinuities. To address these issues, the paper introduces a new Generative Adversarial Network (GAN) model—Enhanced Various Audio Generation via Scalable Generative Adversarial Networks (EVA-GAN). EVA-GAN improves the quality and robustness of audio generation through the following enhancements: 1. **Expanded dataset and model scale**: Utilizes a large dataset containing 36,000 hours of 44.1kHz audio and expands the model parameters to approximately 200 million. 2. **Introduction of Context-Aware Module (CAM)**: This module significantly improves the model's performance without requiring additional computational resources. 3. **Innovative training process**: Includes the use of longer context windows, loss balancers, gradient checkpointing techniques, and improved activation functions to enhance training stability and efficiency. 4. **Human-machine interaction artifact measurement tool**: Developed a new evaluation tool that can more accurately detect and assess artifacts in generated audio, ensuring consistency with human subjective perception. With these improvements, EVA-GAN achieves significant performance enhancements in high-fidelity audio generation tasks, particularly excelling in spectral continuity and high-frequency detail.

EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

MusicHiFi: Fast High-Fidelity Stereo Vocoding

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

BigWavGAN: A Wave-To-Wave Generative Adversarial Network for Music Super-Resolution

Bandwidth Extension on Raw Audio via Generative Adversarial Networks

GANSynth: Adversarial Neural Audio Synthesis

SingGAN: Generative Adversarial Network for High-Fidelity Singing Voice Generation

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

High-Fidelity Audio Compression with Improved RVQGAN

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN

Towards Audio to Scene Image Synthesis using Generative Adversarial Network

FoleyGen: Visually-Guided Audio Generation

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models