Michaela Regneri,Julia S. Georgi,Jurij Kost,Niklas Pietsch,Sabine Stamm
Abstract:We present an approach to compute the monetary value of individual data points, in context of an automated decision system. The proposed method enables us to explore and implement a paradigm of data minimalism for large-scale machine learning systems. Data minimalistic implementations enhance scalability, while maintaining or even optimizing a system's performance. Using two types of recommender systems, we first demonstrate how much data is ineffective in both settings. We then present a general account of computing data value via sensitivity analysis, and how, in theory, individual data points can be priced according to their informational contribution to automated decisions. We further exemplify this method to lab-scale recommender systems and outline further steps towards commercial data-minimalistic applications.
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is **how to quantify the value of a single data point and, based on this, achieve data minimalism**, in order to optimize the performance of large - scale machine - learning systems while reducing data usage. Specifically, the author proposes a method to calculate the monetary value of individual data points, so as to identify data points that can or should be omitted, thereby minimizing the data volume while maintaining or optimizing system performance.
### Main contributions of the paper:
1. **Quantifying redundant data**: Through two examples of recommendation systems, it shows how to quantify redundant data in large - scale data - driven systems.
2. **Calculating the value of data points**: Proposes a method to calculate the value of individual data points in order to identify data points that can or should be omitted, thereby maximizing system performance while minimizing data volume.
3. **Implementing data minimalism**: Lays the foundation for achieving data minimalism as a general digital principle.
### Benefits of data minimalism:
1. **Cost - efficiency**: Reducing data usage can significantly reduce computing costs, including direct costs (such as cloud service fees) and indirect costs (such as energy consumption and time).
2. **Social responsibility**: Reducing unnecessary data transmission can reduce security and privacy risks, while reducing energy consumption and carbon dioxide emissions caused by redundant data processing.
3. **Quality improvement**: Data may contain harmful information, which can damage system performance. The core purpose of data minimalism is to eliminate these harmful data and improve output quality.
4. **Stability**: Using less data can reduce the need for random sampling, thereby improving system stability, especially when legal requirements need to be met.
### Experimental part:
1. **Co - Occurrence Recommender (COR)**:
- By analyzing the influence of user sessions on product recommendations, it was found that 22.7% of user sessions have no influence on product recommendations, that is, these sessions can be ignored.
- The influence of sessions changes over time, with the influence of some sessions increasing and that of some others decreasing.
2. **Vector Recommender (VR)**:
- Use word2vec to calculate product embedding vectors and analyze the influence of different data volumes on system performance.
- The results show that after the data volume reaches a certain level, the conversion rate (CR) begins to decline, while the revenue grows slowly. This indicates that more data is not always better, but there is an optimal data volume.
### Calculation of data value:
- **Assumptions**:
1. Data value must be calculated at the data - point level.
2. Data value depends only on its contribution to decision - making quality.
3. Data - driven decisions can be automated.
- **Prerequisites**:
1. System stability.
2. Quantitative output evaluation.
3. Quantitative performance indicators.
### Calculation of data value in practice:
- Estimate the value of data points by creating different datasets, training different models, and comparing the performance of these models in automatic decision - making.
- Use key performance indicators (KPIs) such as conversion rate, click - through rate, and annual revenue to quantify the contribution of data points.
In conclusion, through both theoretical and experimental aspects, this paper explores how to achieve data minimalism in large - scale machine - learning systems, thereby maintaining or optimizing system performance while reducing data usage.