Abstract:Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at <a class="link-external link-https" href="https://github.com/usc-sail/SynthAudio" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Evaluating whether synthetic audio can assist in audio recognition and speech modeling**. Specifically, the author explores the following two main issues: 1. **Quality assessment of synthetic audio**: - Traditional audio generation quality assessment methods rely on distance metrics (such as Frechet Audio Distance, FAD), but this method cannot fully reflect the degree of matching between the generated audio and the real audio. - The paper proposes a new assessment method, that is, to assess its quality by examining the effectiveness of synthetic audio as training data. 2. **Application of synthetic audio in audio recognition and speech modeling**: - Research whether synthetic audio can be used as a data - augmentation resource to improve the performance of audio recognition and speech modeling. - Explore the performance of synthetic audio in zero - shot learning, and the influence of the number of generations and prompt diversity on zero - shot audio recognition. ### Specific research content - **Selection of audio generation models**: The paper uses three popular audio generation models - AUDIOGEN, AudioLDM 2, and MusicGen, which are used to generate general sounds, human action sounds, and music respectively. - **Experimental design**: - **Zero - shot audio recognition**: Compare the performance of synthetic audio of different generation models in zero - shot learning. - **Mixed training**: Mix synthetic audio with real audio for training and study its impact on audio recognition performance. - **Data augmentation**: Explore the effect of synthetic audio as a data - augmentation method in speech emotion recognition and keyword recognition tasks. ### Main findings - **Zero - shot audio recognition**: MusicGen performs best in zero - shot music classification, while AUDIOGEN performs better in zero - shot recognition of other audio categories. - **Prompt diversity**: Prompts generated with the assistance of a language model (LLM) can significantly improve the performance of zero - shot audio recognition. - **Influence of the number of generations**: Increasing the number of generated audio can improve the accuracy of zero - shot recognition, but when the number of generations reaches a certain scale, the improvement effect tends to be saturated. - **Mixed training**: In the case of limited real audio, mixed training significantly improves audio recognition performance; but when real audio is sufficient, simple mixing does not always bring performance improvement. - **Data augmentation**: Synthetic audio as a data - augmentation method performs well in speech - related tasks, especially in improving the robustness of the model to environmental noise. In conclusion, through systematic research, this paper shows that synthetic audio has potential application value in audio recognition and speech modeling, especially in scenarios where data is limited or the robustness of the model needs to be enhanced.

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Audio Explanation Synthesis with Generative Foundation Models

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

GANSynth: Adversarial Neural Audio Synthesis

Synthetic training set generation using text-to-audio models for environmental sound classification

Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

Contrastive Learning from Synthetic Audio Doppelgangers

A Framework for Synthetic Audio Conversations Generation using Large Language Models

Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio

SONAR: A Synthetic AI-Audio Detection Framework and Benchmark

Computer Audition: From Task-Specific Machine Learning to Foundation Models

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

High Fidelity Speech Synthesis with Adversarial Networks

AudioSR: Versatile Audio Super-resolution at Scale

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Sparks of Large Audio Models: A Survey and Outlook

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Improved Techniques for the Conditional Generative Augmentation of Clinical Audio Data

MIMII-Gen: Generative Modeling Approach for Simulated Evaluation of Anomalous Sound Detection System