Abstract:The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a doctor auscultates a patient by listening to the sounds of their organs. Using text-to-audio latent diffusion models trained on the AudioCaps dataset, we systematically analyze memorization behavior as a function of training set size. We also evaluate different retrieval metrics for evidence of training data memorization, finding the similarity between mel spectrograms to be more robust in detecting matches than learned embedding vectors. In the process of analyzing memorization in audio latent diffusion models, we also discover a large amount of duplicated audio clips within the AudioCaps database.

What problem does this paper attempt to address?

The problem this paper attempts to address is whether audio latent diffusion models replicate or memorize training data when generating audio. Specifically, the authors explore the potential memorization behavior of these models by analyzing the similarity between generated audio and training data. This study aims to understand the internal working mechanisms of these models and evaluate the effectiveness of different signal representation methods in detecting training data replication. ### Main Research Content: 1. **Defining Replication**: The authors define a generated audio file as replicating training data if it contains almost identical complex spectro-temporal patterns. 2. **Experimental Method**: Experiments were conducted using the TANGO model (a text-to-audio generation model) by generating audio samples with different training set sizes (1000, 5000, and the full AudioCaps dataset). Mel spectrograms and Contrastive Language-Audio Pretraining (CLAP) descriptors were used to detect similarities between generated samples and training data. 3. **Results Analysis**: By comparing the performance of different descriptors, the study evaluates whether there is replication of training data in the generated samples and explores the reasons behind the replication phenomenon. ### Main Findings: - On small training sets, the model exhibits significant overfitting and replication behavior, with generated audio being highly similar to the training data. - On large training sets, although replication still occurs, it is relatively less frequent and mainly concentrated on certain complex spectro-temporal patterns. - Mel spectrograms perform better in detecting replication, accurately identifying similarities between generated samples and training data. - A large number of duplicate audio segments were found in the AudioCaps dataset, which may affect the model's training effectiveness. ### Research Significance: This study helps to understand the potential memorization and replication behavior of generative models during training from a technical perspective, providing scientific evidence for addressing related legal and ethical issues. Additionally, the research results offer references for further optimization of audio generative models.

Generation or Replication: Auscultating Audio Latent Diffusion Models

ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Text-Driven Foley Sound Generation With Latent Diffusion Model

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

SoundLoCD: An Efficient Conditional Discrete Contrastive Latent Diffusion Model for Text-to-Sound Generation

Long-form music generation with latent diffusion

Noise2Music: Text-conditioned Music Generation with Diffusion Models

Multi-Source Music Generation with Latent Diffusion

Retrieval-Augmented Text-to-Audio Generation

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

MIMII-Gen: Generative Modeling Approach for Simulated Evaluation of Anomalous Sound Detection System

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

Diffsound: Discrete Diffusion Model for Text-to-Sound Generation

On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models

Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer