Abstract:Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by developing an \textit{influence functions} framework. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we systematically develop K-FAC approximations based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We recast previously proposed methods as specific design choices in our framework and show that our recommended method outperforms previous data attribution approaches on common evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges faced by Diffusion Models in terms of data attribution and interpretability. Specifically: 1. **Data Attribution**: When a diffusion model generates a specific sample, how to quantify the influence of each data point in the training data on the model's output. Especially in terms of copyright issues, understanding which training data has the greatest impact on the generated sample can help identify and potentially remove those data points that lead to undesired outputs. 2. **Interpretability**: When the model's output is not ideal, it is possible to find out which training data has a greater impact on these outputs, thereby improving the transparency and controllability of the model. To meet these challenges, the authors developed a framework based on the influence function to predict how the model's output would change if certain training data were removed. For diffusion models, they focused on predicting the probability change of generating a specific sample and achieved this through several proxy measurements. ### Main Methods - **Influence Function Framework**: The influence function can approximately answer the question "What would the output be if the model was trained with certain data excluded?" Through this method, the training data points that have the most influence on low loss or high generation probability can be found. - **K - FAC Approximation**: To ensure the scalability of Hessian calculations, the authors systematically developed the Kronecker - Factored Approximate Curvature (K - FAC) approximation method based on the Generalized Gauss - Newton matrix (GGN), which is specifically optimized for diffusion models. - **Design Space Unification**: The authors reinterpreted the previously proposed methods as specific design choices in their framework and showed that the recommended method outperforms existing methods on common evaluation metrics such as the Linear Data Modeling Score (LDS) or retraining after removing top - influencing data, without the need for hyperparameter tuning of specific methods. ### Key Contributions - Proposed a scalable influence function approximation method suitable for the data attribution problem of diffusion models. - Unified and improved previous work, providing more effective tools to understand and analyze the behavior of diffusion models. - Empirical studies show that the proposed method outperforms existing methods on multiple evaluation metrics. Through these methods, the authors hope to better understand the impact of training data on generated samples in diffusion models, thereby improving the transparency and controllability of the model, especially in application scenarios involving copyright issues.

Influence Functions for Scalable Data Attribution in Diffusion Models

Influence Functions for Scalable Data Attribution in Diffusion Models

Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation

Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Intriguing Properties of Data Attribution on Diffusion Models

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Training Data Attribution for Diffusion Models

A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

DSCom: A Data-Driven Self-Adaptive Community-Based Framework for Influence Maximization in Social Networks

H-Diffu: Hyperbolic Representations for Information Diffusion Prediction

Unveiling Concept Attribution in Diffusion Models

Influence Maximization with Fairness at Scale (Extended Version)

Physics-Informed Diffusion Models

Scalable Continuous-time Diffusion Framework for Network Inference and Influence Estimation

Scalable Influence Estimation Without Sampling

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

Diffusion Model for Data-Driven Black-Box Optimization

Influence Functions in Deep Learning Are Fragile

Influence-based Attributions can be Manipulated

The Emergence of Reproducibility and Generalizability in Diffusion Models

Training Data Attribution via Approximate Unrolled Differentiation