DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

Hyowon Cho,Soonwon Ka,Daechul Park,Jaewook Kang,Minjoon Seo,Bokyung Son
2024-12-09
Abstract:Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem that large - language models (LLMs) rely on pre - trained knowledge rather than actual data patterns when processing large - scale datasets. Specifically, the paper points out the following problems in LLMs when extracting latent features: 1. **Data Foundation Problem**: LLMs tend to rely on pre - trained knowledge rather than the specific features of the input data, resulting in generated features that may not truly reflect the characteristics of the data itself. 2. **Verification Difficulty**: Due to the lack of quantitative evaluation methods, it is difficult to verify whether the responses generated by LLMs are accurate, which requires expert supervision and increases costs. 3. **Subjective Bias**: Subjective bias is easily introduced in the process of human data analysis, and the cooperation cost with domain experts is high. To solve these problems, the author proposes a framework named Data Scientist AI (DSAI), aiming to achieve unbiased and interpretable latent feature extraction through a multi - stage pipeline. The main objectives of DSAI are: - **Reduce Bias**: Ensure that LLMs rely on the data itself rather than their pre - trained knowledge when extracting latent features. - **Introduce Quantitative Indicators**: Introduce a quantitative indicator to measure the significance of features in order to evaluate the discriminative ability of each feature. - **Improve Interpretability**: Improve the transparency and interpretability of the feature extraction process through the traceability of features to the source data. - **Automate the Processing of Large - scale Datasets**: Systematically guide LLMs in data analysis, reduce manual labor, and ensure that the output is based on data rather than domain - specific assumptions. Overall, this paper aims to improve the feature extraction ability of LLMs on large - scale datasets through the DSAI framework, making it more objective, reliable, and interpretable.