Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data

Haoyang Liu,Yijiang Li,Jinglin Jian,Yuxuan Cheng,Jianrong Lu,Shuyi Guo,Jinglei Zhu,Mianchen Zhang,Miantong Zhang,Haohan Wang
2024-02-21
Abstract:Machine learning has emerged as a powerful tool for scientific discovery, enabling researchers to extract meaningful insights from complex datasets. For instance, it has facilitated the identification of disease-predictive genes from gene expression data, significantly advancing healthcare. However, the traditional process for analyzing such datasets demands substantial human effort and expertise for the data selection, processing, and analysis. To address this challenge, we introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM). These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes. Furthermore, we have curated a benchmark dataset to assess TAIS's effectiveness in gene identification, demonstrating our system's potential to significantly enhance the efficiency and scope of scientific exploration. Our findings represent a solid step towards automating scientific discovery through large language models.
Genomics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue in gene expression data analysis where traditional methods require substantial human effort and expertise for data selection, processing, and analysis. To tackle this challenge, the authors propose a new framework—Team of AI Scientists (TAIS), which aims to automate the scientific discovery process by simulating different research roles (such as project manager, data engineer, and domain expert). Specifically, the TAIS system leverages large language models (LLMs) to perform tasks typically handled by data scientists, with a particular focus on identifying predictive genes related to diseases. Additionally, the authors have created a benchmark dataset to evaluate the effectiveness of TAIS in gene identification, demonstrating the system's potential to enhance the efficiency and scope of scientific exploration. Overall, the goal of the paper is to automate the scientific discovery process using large language models, thereby reducing the need for human effort and technical expertise.