EcoVal: An Efficient Data Valuation Framework for Machine Learning

Ayush K Tarun,Vikram S Chundawat,Murari Mandal,Hong Ming Tan,Bowei Chen,Mohan Kankanhalli
2024-07-09
Abstract:Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner. Instead of directly working with individual data sample, we determine the value of a cluster of similar data points. This value is further propagated amongst all the member cluster points. We show that the overall value of the data can be determined by estimating the intrinsic and extrinsic value of each data. This is enabled by formulating the performance of a model as a \textit{production function}, a concept which is popularly used to estimate the amount of output based on factors like labor and capital in a traditional free economic market. We provide a formal proof of our valuation technique and elucidate the principles and mechanisms that enable its accelerated performance. We demonstrate the real-world applicability of our method by showcasing its effectiveness for both in-distribution and out-of-sample data. This work addresses one of the core challenges of efficient data valuation at scale in machine learning models. The code is available at \underline{<a class="link-external link-https" href="https://github.com/respai-lab/ecoval" rel="external noopener nofollow">this https URL</a>}.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **How to efficiently evaluate the value of data in machine learning**. Specifically, the existing data valuation frameworks based on Shapley values are computationally expensive because they require a large number of repeated model trainings to obtain the Shapley values of each data point. This not only consumes a great deal of time and computational resources but also increases the carbon footprint, which is harmful to the environment. To solve this problem, the author proposes an efficient data - valuation framework named **EcoVal**. The main contributions of EcoVal include: 1. **Novel framework**: EcoVal clusters similar data points and estimates the marginal contribution of the entire cluster instead of directly handling individual data samples. This method significantly reduces the number of data points that need to be processed, thereby improving computational efficiency. 2. **Application of production function**: The author introduces the concept of production function in economics to represent the relationship between data and model performance. In this way, they can estimate the intrinsic and extrinsic values of each data point and then determine the value of individual data. 3. **Computational efficiency**: EcoVal only needs to examine the marginal contribution of representative data points, avoiding the overhead caused by creating multiple subsets containing similar data points. This method enables EcoVal to be extended to large - scale data sets without being limited by the existence of similar data points in the data set. 4. **Theoretical proof**: The author provides a theoretical proof of their data - valuation method and shows that their method has a very small error compared with the traditional Shapley - value approximation method. 5. **Empirical evaluation**: Through experiments on data sets such as MNIST, CIFAR10, and CIFAR100, the author compares the performance of EcoVal with other existing state - of - the - art data - valuation methods (such as Data Shapley, LOO error, and Distributional Shapley). The results show that EcoVal significantly accelerates the data - valuation process while maintaining or improving performance. ### Formula summary - **Leave - One - Out (LOO) error**: \[ \Phi_{\text{LOO}}(z; U, B)=U(B)-U(B\setminus\{z\}) \] - **Shapley value**: \[ \Phi_s(z; U, B)=\frac{1}{m}\sum_{k = 1}^m\binom{m - 1}{k - 1}^{-1}\sum_{S\subseteq B\setminus\{z\},|S|=k - 1}\Delta(z; U, S) \] where \(\Delta(z; U, S)=U(S\cup\{z\})-U(S)\) - **Marginal contribution at the cluster level**: \[ V_c = U(B)-U(B\setminus c) \] - **Production function form**: \[ U_T(S, N)=Af(S)h_T(N) \] where \(f(S)\) represents the information utility of the data set \(S\) for the model prediction efficiency, and \(h_T(N)\) represents the influence of the model capacity on the task \(T\) - **Adjusted data point value**: \[ V_i^*=\alpha_i\beta_i^* \] where \(\alpha_i\) is the intrinsic value of the data point, and \(\beta_i^*\) is the extrinsic factor of the interaction between the data point and the remaining data points. Through these improvements, EcoVal provides a more efficient and practical data - valuation method, which is suitable for data evaluation of large - scale machine - learning models.