Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

Jiachen T. Wang,Prateek Mittal,Ruoxi Jia
2024-01-20
Abstract:This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
Data Structures and Algorithms,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is an open problem in the data valuation literature, namely how to efficiently calculate the data Shapley value of the weighted K - nearest neighbor algorithm (WKNN - Shapley). Specifically, for the hard - label KNN classifier with discrete weights, the authors reformulate the calculation of WKNN - Shapley as a counting problem and introduce an algorithm with quadratic time complexity, which is a significant improvement over the best O(NK) result in the existing literature. In addition, they also develop a deterministic approximation algorithm, which further improves the computational efficiency while maintaining the key fairness properties of the Shapley value. ### Specific Problem Description 1. **Background and Motivation**: - Data is the core of machine - learning models, but not all data are of equal quality. In real - world scenarios, data usually has noise and bias, comes from diverse sources and has different labeling processes. - Data valuation, as an emerging research area, aims to quantify the quality of each data source used for machine - learning training. This technique is used to diagnose influential training instances in explainable machine - learning and for fair compensation in the data market. 2. **Existing Challenges**: - Although previous research has shown that unweighted KNN - Shapley can be efficiently calculated, no practical and efficient algorithm has been developed for the more general weighted KNN - Shapley (WKNN - Shapley). - The time complexity of the existing polynomial - time algorithms is O(NK), which becomes impractical even for a small K (such as 5). 3. **Research Objectives**: - Propose a method for efficiently calculating WKNN - Shapley to bridge the efficiency gap of existing methods. - Verify the computational efficiency of the proposed method and its superior performance in discerning data quality through experiments. ### Main Contributions - **Adjust KNN Configuration to Adapt to Shapley Value Calculation**: By making necessary modifications to a specific KNN classifier configuration, the focus is shifted to the hard - label KNN classifier with discrete weights. - **Exact Calculation Algorithm with Quadratic Time Complexity**: Based on the adjusted "Shapley - friendly" configuration, the Shapley value calculation is reformulated as a counting problem, and an algorithm with quadratic time complexity is developed to solve this counting problem. - **Deterministic Approximation Algorithm with Sub - quadratic Time Complexity**: By fine - tuning the exact WKNN - Shapley implementation, a deterministic approximation algorithm is proposed, which further improves the computational efficiency while retaining the key fairness properties of the original Shapley value. - **Empirical Evaluation**: Experiments are carried out on benchmark datasets to evaluate the efficiency and effectiveness of the proposed exact and approximate algorithms. Through these improvements, the authors show that WKNN - Shapley can be efficiently calculated and approximated, thus promoting its wider application and providing a more effective data valuation method than unweighted KNN - Shapley.