ChIP-GPT: a managed large language model for robust data extraction from biomedical database records

Olivier Cinquin
DOI: https://doi.org/10.1093/bib/bbad535
IF: 9.5
2024-02-06
Briefings in Bioinformatics
Abstract:Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors—a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90–94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to robustly extract data from biomedical database records. Specifically, the authors focus on how to use large - language models (LLMs) to process and analyze large - scale biomedical data records from the Sequence Read Archive (SRA), especially to identify chromatin immunoprecipitation (ChIP) targets and cell lines. ### Detailed description of the main problem 1. **Limitations of existing tools**: - Existing tools mainly focus on extracting predefined fields, but often fail to comprehensively process database entries or correct obvious errors. - These tools lack the ability to reason like domain experts, limiting their robustness and depth of analysis. 2. **Challenges in large - scale applications**: - When applying LLMs to large - scale databases, the following challenges are faced: - Automation of interactions with LLMs is required. - Input length limitations may require pre - processing of records (such as pruning or summarization). - To ensure the reliable operation of LLMs, well - designed "few - shot" examples or fine - tuning based on a large number of carefully curated examples are required. 3. **Complexity of specific tasks**: - The metadata structures in the SRA vary greatly, and many records contain misspellings or missing field labels. - Many experiments not only rely on antibodies against target proteins, but may also express chimeras with fusion tags, increasing the complexity of identifying ChIP targets. ### Overview of the solution To solve the above problems, the authors developed a system named ChIP - GPT, which is based on the fine - tuned generative pre - training transformer (GPT) model Llama and automatically processes SRA records through iterative prompting of the model. ChIP - GPT is able to: - Automatically identify experimental treatments and extract ChIP target and cell line information. - Process records with misspellings or missing field labels. - Adapt to different types of databases and customized problems. ### Performance evaluation By evaluating 50 records that did not participate in the training, ChIP - GPT achieved an accuracy rate of 90 - 94% in the ChIP target and cell line identification tasks, which is comparable to the accuracy of manual review. In conclusion, this paper aims to provide an efficient, accurate, and scalable method for processing and analyzing large - scale biomedical data records in the SRA by introducing ChIP - GPT.