Data Acquisition: A New Frontier in Data-centric AI

Lingjiao Chen,Bilge Acun,Newsha Ardalani,Yifan Sun,Feiyang Kang,Hanrui Lyu,Yongchan Kwon,Ruoxi Jia,Carole-Jean Wu,Matei Zaharia,James Zou
2023-11-23
Abstract:As Machine Learning (ML) systems continue to grow, the demand for relevant and comprehensive datasets becomes imperative. There is limited study on the challenges of data acquisition due to ad-hoc processes and lack of consistent methodologies. We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. The benchmark was released as a part of DataPerf. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in ML.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily focuses on the challenges of data acquisition in machine learning (ML) systems. As ML systems continue to grow, the need for relevant and comprehensive datasets becomes crucial. However, there are several key issues in the current data market: 1. **Information Opacity**: Most data providers are reluctant to provide detailed information about the complete dataset to data buyers, making it difficult for buyers to design effective data acquisition strategies. 2. **Pricing Opacity**: The pricing mechanisms in data markets are usually not public, requiring private negotiations to determine the price, which increases the complexity and cost of transactions. 3. **Non-uniform Data Formats**: Different data providers offer data in various formats, requiring buyers to do extra work to convert these data into a uniform format. 4. **Lack of Effective Data Acquisition Strategies**: Existing data markets lack systematic methods to help buyers understand the quality and relevance of potential data before transactions. To address these challenges, the authors propose a benchmark called DAM (Data Acquisition for ML), which aims to simulate the interactions between data providers and buyers in the data market and encourage the community to develop more effective data acquisition strategies. The goals of DAM are: - **Budget Awareness**: Enable data buyers to easily understand the cost of the data they purchase and make informed decisions based on their budget. - **Price Transparency**: Allow data providers to publicly disclose their pricing models, enabling buyers to compare prices across different providers. - **Support for Multiple Data Sources**: Provide diverse data sources, allowing buyers to choose the data that best fits their needs. - **Useful Information Sharing**: Enable data buyers and providers to share information and insights to improve the quality and relevance of the data being sold. Through DAM, the authors hope to lay the foundation for data acquisition in the data-centric AI field and inspire a broader range of researchers to address the key challenges in this area.