K-Fold Cross-Valuation for Machine Learning Using Shapley Value

Qiangqiang He,Mujie Zhang,Jie Zhang,Shang Yang,Chongjun Wang
DOI: https://doi.org/10.1007/978-3-031-44213-1_5
2023-01-01
Abstract:Research on data valuation using Shapley value has recently garnered significant attention. Existing approaches typically estimate the value of the training set by using the model’s performance on a validation set as a utility function. However, since the validation set is often a small subset of the complete dataset, a dataset shift between the training and validation sets may lead to biased data valuation. To address this issue, this paper proposes a k-fold cross-validation method based on the Shapley value. Specifically, the dataset is divided into k subsets, and each subset is employed in turn as a validation set to evaluate the valuation of the training set composed of the remaining k - 1 subsets by using the Shapley value. The average of k - 1 valuations of each data instance is taken as the valuation result. Given the exponential correlation between the Shapley value’s computation overhead and the volume of data, we propose the Monte Carlo permutation, incremental learning, and batch data valuation methodologies. This approach aids in approximating the true Shapley value as precisely as possible while simultaneously reducing computation time. Extensive experiments have demonstrated the effectiveness of our method, especially in the presence of noise and outliers in the validation set.
What problem does this paper attempt to address?