Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

Tiantian Feng,Dimitrios Dimitriadis,Shrikanth Narayanan
2024-08-29
Abstract:Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at <a class="link-external link-https" href="https://github.com/usc-sail/SynthAudio" rel="external noopener nofollow">this https URL</a>.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Evaluating whether synthetic audio can assist in audio recognition and speech modeling**. Specifically, the author explores the following two main issues: 1. **Quality assessment of synthetic audio**: - Traditional audio generation quality assessment methods rely on distance metrics (such as Frechet Audio Distance, FAD), but this method cannot fully reflect the degree of matching between the generated audio and the real audio. - The paper proposes a new assessment method, that is, to assess its quality by examining the effectiveness of synthetic audio as training data. 2. **Application of synthetic audio in audio recognition and speech modeling**: - Research whether synthetic audio can be used as a data - augmentation resource to improve the performance of audio recognition and speech modeling. - Explore the performance of synthetic audio in zero - shot learning, and the influence of the number of generations and prompt diversity on zero - shot audio recognition. ### Specific research content - **Selection of audio generation models**: The paper uses three popular audio generation models - AUDIOGEN, AudioLDM 2, and MusicGen, which are used to generate general sounds, human action sounds, and music respectively. - **Experimental design**: - **Zero - shot audio recognition**: Compare the performance of synthetic audio of different generation models in zero - shot learning. - **Mixed training**: Mix synthetic audio with real audio for training and study its impact on audio recognition performance. - **Data augmentation**: Explore the effect of synthetic audio as a data - augmentation method in speech emotion recognition and keyword recognition tasks. ### Main findings - **Zero - shot audio recognition**: MusicGen performs best in zero - shot music classification, while AUDIOGEN performs better in zero - shot recognition of other audio categories. - **Prompt diversity**: Prompts generated with the assistance of a language model (LLM) can significantly improve the performance of zero - shot audio recognition. - **Influence of the number of generations**: Increasing the number of generated audio can improve the accuracy of zero - shot recognition, but when the number of generations reaches a certain scale, the improvement effect tends to be saturated. - **Mixed training**: In the case of limited real audio, mixed training significantly improves audio recognition performance; but when real audio is sufficient, simple mixing does not always bring performance improvement. - **Data augmentation**: Synthetic audio as a data - augmentation method performs well in speech - related tasks, especially in improving the robustness of the model to environmental noise. In conclusion, through systematic research, this paper shows that synthetic audio has potential application value in audio recognition and speech modeling, especially in scenarios where data is limited or the robustness of the model needs to be enhanced.