CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning

Huaiguang Cai
2024-06-18
Abstract:Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model accuracy. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (Conduct of Hardness and Gradient) score, which approximates the utility of each data subset on model accuracy during a single model training. By deriving the closed-form expression of the Shapley value for each data point under the CHG score utility function, we reduce the computational complexity to the equivalent of a single model retraining, an exponential improvement over existing methods. Additionally, we employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data. CHG Shapley facilitates trustworthy model training through efficient data valuation, introducing a novel data-centric perspective on trustworthy machine learning.
Computer Science and Game Theory,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently perform data valuation on large - scale datasets to promote trustworthy machine learning. Specifically, the paper proposes a new method - the CHG (Conduct of Hardness and Gradient) score - to evaluate the impact of data subsets on model accuracy. By deriving the analytical expression of the Shapley value of each data point under the CHG score utility function, the paper significantly reduces the computational complexity, making it equivalent to the time cost of one model retraining, thus solving the problems of resource - intensiveness and time - consuming of existing data valuation methods (such as Data Shapley) when dealing with large - scale datasets. ### Main Contributions: 1. **Efficient Data Valuation Method for Large - scale Datasets**: The CHG score is introduced to evaluate the impact of data subsets on model accuracy, and the computational efficiency is significantly improved through the analytical expression, making the data valuation time of the entire training dataset at most twice the normal training time, equivalent to one additional model retraining. 2. **Real - time Training Data Selection for Large - scale Datasets Based on Data Valuation**: CHG Shapley is used for real - time data selection, and experiments show its effectiveness in identifying high - value and noisy data. 3. **A New Data Perspective for Trustworthy Machine Learning**: CHG Shapley is a parameter - free method, and its advantages come entirely from a deep understanding of the data. The researchers believe that data - valuation - based methods have the potential to enhance the understanding of model mechanisms, thereby promoting the development of trustworthy machine learning. ### Background and Related Work: - **Data - Centric AI**: In recent years, more and more researchers have begun to focus on trustworthy machine learning research from the data perspective because the quality of data directly affects the performance of the model. - **Shapley Value**: The Shapley value is a method for fairly distributing the contributions of each participant in a cooperative game, but its computational complexity is O(2^n), so approximate methods are needed to improve efficiency. - **Data Valuation**: Data valuation aims to quantitatively analyze the impact of training data on the performance of machine learning models, especially deep neural networks. Existing data valuation methods such as Data Shapley, KNN Shapley, Beta Shapley, etc., are effective but still have efficiency problems when dealing with large - scale datasets. ### Method Details: - **CHG Score**: The CHG score combines the difficulty and gradient information of data points to evaluate the impact of data subsets on model accuracy. Through the analytical expression, the Shapley value of each data point can be efficiently calculated. - **CHG Shapley Algorithm**: Based on the CHG score, the analytical expression of the Shapley value of each data point is derived, thereby achieving efficient data valuation. ### Experimental Results: - **Data Pruning Settings**: CHG Shapley performs well in data pruning tasks, especially when the selection ratio is small, it can effectively identify high - value data. - **Noise Label Detection**: In the case where 30% of class labels are randomly replaced, CHG Shapley shows better performance than other methods and can effectively detect and handle label noise. In conclusion, through proposing the CHG Shapley method, this paper not only improves the efficiency of data valuation but also provides a new data perspective for trustworthy machine learning.