DataLab: A Platform for Data Analysis and Intervention

Yang Xiao,Jinlan Fu,Weizhe Yuan,Vijay Viswanathan,Zhoumianze Liu,Yixin Liu,Graham Neubig,Pengfei Liu
DOI: https://doi.org/10.48550/arXiv.2202.12875
2022-02-26
Abstract:Despite data's crucial role in machine learning, most existing tools and research tend to focus on systems on top of existing data rather than how to interpret and manipulate data. In this paper, we propose DataLab, a unified data-oriented platform that not only allows users to interactively analyze the characteristics of data, but also provides a standardized interface for different data processing operations. Additionally, in view of the ongoing proliferation of datasets, \toolname has features for dataset recommendation and global vision analysis that help researchers form a better view of the data ecosystem. So far, DataLab covers 1,715 datasets and 3,583 of its transformed version (e.g., hyponyms replacement), where 728 datasets support various analyses (e.g., with respect to gender bias) with the help of 140M samples annotated by 318 feature functions. DataLab is under active development and will be supported going forward. We have released a web platform, web API, Python SDK, PyPI published package and online documentation, which hopefully, can meet the diverse needs of researchers.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: although data plays a crucial role in machine learning, most of the existing tools and research focus on systems based on existing data, while ignoring how to interpret and manipulate data. To solve this problem, the author proposes the DATALAB platform, which aims to provide a unified data - oriented platform that not only allows users to interactively analyze data features, but also provides a standardized data - processing operation interface. Specifically, the paper mainly focuses on the following aspects: 1. **Data Analysis and Diagnosis**: - Most of the existing research focuses on explaining the output of machine - learning systems, while ignoring in - depth understanding of the data itself. DATALAB helps users discover undesirable characteristics in data, such as hate speech, gender bias or label imbalance, etc., by providing data - diagnosis functions. 2. **Operation Standardization**: - Since different data - processing packages use different interfaces, users need to install multiple toolkits to meet diverse needs, which reduces development efficiency and affects reproducibility. DATALAB alleviates these problems by providing a unified, standardized data - processing operation interface. 3. **Data Search**: - With the rapid growth in the number of datasets, choosing a dataset suitable for a specific application scenario has become a difficult problem. DATALAB provides a semantic - dataset - search tool to help researchers find suitable datasets. 4. **Global Analysis**: - In addition to the analysis of a single dataset, analyzing the existing dataset ecosystem as a whole can reveal deeper - level problems. DATALAB provides tools for global analysis across multiple datasets to identify systemic inequalities. In summary, DATALAB aims to solve the current deficiencies in data - processing and - analysis in the field of natural language processing (NLP) by providing a comprehensive platform, thereby promoting more efficient and more transparent research work.