Abstract:Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training's underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data's influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound. A curated, up-to-date list of resources related to influence analysis is available at <a class="link-external link-https" href="https://github.com/ZaydH/influence_analysis_papers" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to provide a comprehensive review of training data influence analysis and its estimation methods. Specifically: 1. **Background and Motivation**: - Modern machine learning models are increasingly dependent on training data, while their internal mechanisms are becoming more opaque. - Anomalous training instances can lead to a decline in the overall generalization performance of the model, including due to natural causes, measurement errors, or human annotation errors. - Large-scale datasets often contain a significant number of anomalous instances, which may come from various potential sources. - Biases may be introduced during the model training process, leading to harm in practical applications. 2. **Core Issues**: - Partially reveal the relationship between training data and model predictions by quantifying the impact of each training instance on the final model. - Propose and compare the different assumptions, complexities, and pros and cons of existing influence analysis methods. - Explore how to make influence analysis more useful in practice and more robust both theoretically and empirically. 3. **Research Contributions**: - For the first time, provide a comprehensive review of existing training data influence analysis techniques. - Compare the definitions, assumptions, complexities, and pros and cons of different influence analysis methods. - Propose future research directions to improve the practicality and theoretical foundation of influence analysis. 4. **Main Method Classifications**: - **Point-to-Point Influence Analysis**: Quantify the impact of a single training instance on a single test instance. - **Retraining-Based Methods**: Measure influence by repeatedly retraining the model. - **Gradient-Based Methods**: Estimate influence through the alignment of training instance gradients. In summary, this paper aims to provide researchers and practitioners with a comprehensive understanding framework by systematically reviewing and comparing different influence analysis methods, enabling them to choose the most appropriate influence analysis method based on specific application scenarios.

Training Data Influence Analysis and Estimation: A Survey

Training data influence analysis and estimation: a survey

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

Capturing the Temporal Dependence of Training Data Influence

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

A Bayesian Approach To Analysing Training Data Attribution In Deep Learning

If Influence Functions are the Answer, Then What is the Question?

Influence Functions in Deep Learning Are Fragile

Empirical influence functions to understand the logic of fine-tuning

Survey of social influence analysis and modeling

Position: Insights from Survey Methodology can Improve Training Data

A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

Channel-wise Influence: Estimating Data Influence for Multivariate Time Series

Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation

Mining Influential Training Data by Tracing Influence on Hard Validation Samples

Disentangling Influence: Using Disentangled Representations to Audit Model Predictions

DIVINE: Diverse Influential Training Points for Data Visualization and Model Refinement

Estimating individual treatment effect: generalization bounds and algorithms

Influence Functions for Scalable Data Attribution in Diffusion Models

Finding Key Training Data by Calculating Influence Score.