Abstract:This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is an open problem in the data valuation literature, namely how to efficiently calculate the data Shapley value of the weighted K - nearest neighbor algorithm (WKNN - Shapley). Specifically, for the hard - label KNN classifier with discrete weights, the authors reformulate the calculation of WKNN - Shapley as a counting problem and introduce an algorithm with quadratic time complexity, which is a significant improvement over the best O(NK) result in the existing literature. In addition, they also develop a deterministic approximation algorithm, which further improves the computational efficiency while maintaining the key fairness properties of the Shapley value. ### Specific Problem Description 1. **Background and Motivation**: - Data is the core of machine - learning models, but not all data are of equal quality. In real - world scenarios, data usually has noise and bias, comes from diverse sources and has different labeling processes. - Data valuation, as an emerging research area, aims to quantify the quality of each data source used for machine - learning training. This technique is used to diagnose influential training instances in explainable machine - learning and for fair compensation in the data market. 2. **Existing Challenges**: - Although previous research has shown that unweighted KNN - Shapley can be efficiently calculated, no practical and efficient algorithm has been developed for the more general weighted KNN - Shapley (WKNN - Shapley). - The time complexity of the existing polynomial - time algorithms is O(NK), which becomes impractical even for a small K (such as 5). 3. **Research Objectives**: - Propose a method for efficiently calculating WKNN - Shapley to bridge the efficiency gap of existing methods. - Verify the computational efficiency of the proposed method and its superior performance in discerning data quality through experiments. ### Main Contributions - **Adjust KNN Configuration to Adapt to Shapley Value Calculation**: By making necessary modifications to a specific KNN classifier configuration, the focus is shifted to the hard - label KNN classifier with discrete weights. - **Exact Calculation Algorithm with Quadratic Time Complexity**: Based on the adjusted "Shapley - friendly" configuration, the Shapley value calculation is reformulated as a counting problem, and an algorithm with quadratic time complexity is developed to solve this counting problem. - **Deterministic Approximation Algorithm with Sub - quadratic Time Complexity**: By fine - tuning the exact WKNN - Shapley implementation, a deterministic approximation algorithm is proposed, which further improves the computational efficiency while retaining the key fairness properties of the original Shapley value. - **Empirical Evaluation**: Experiments are carried out on benchmark datasets to evaluate the efficiency and effectiveness of the proposed exact and approximate algorithms. Through these improvements, the authors show that WKNN - Shapley can be efficiently calculated and approximated, thus promoting its wider application and providing a more effective data valuation method than unweighted KNN - Shapley.

Efficient Data Shapley for Weighted Nearest Neighbor Algorithms

Accelerating Exact Nearest Neighbor Search in High Dimensional Euclidean Space Via Block Vectors

Towards Data Valuation via Asymmetric Data Shapley

Dynamic Shapley Value Computation.

Efficient Data-aware Distance Comparison Operations for High-Dimensional Approximate Nearest Neighbor Search

Optimizing Data Shapley Interaction Calculation from O(2^n) to O(t n^2) for KNN models

Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?

Towards Optimal Attribute Weight Setting for 1-Nearest Neighbor Learning Algorithms

Distributionally Robust Weighted $k$-Nearest Neighbors

Data valuation: The partial ordinal Shapley value for machine learning

Effective and General Distance Computation for Approximate Nearest Neighbor Search

Efficient Estimation of k for the Nearest Neighbors Class of Methods

Fast Shapley Value Estimation: A Unified Approach

Efficient Sampling Approaches to Shapley Value Approximation

On the Inflation of KNN-Shapley Value

On Kernel Difference-Weighted K -Nearest Neighbor Classification

Scalable Distributed Hashing for Approximate Nearest Neighbor Search

CS-Shapley: Class-wise Shapley Values for Data Valuation in Classification

A New Hashing based Nearest Neighbors Selection Technique for Big Datasets

A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing.

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits