Enhancing Gene Set Overrepresentation Analysis with Large Language Models

Jiqing Zhu,Rebecca Y. Wang,Xiaoting Wang,Ricardo Azevedo,Alexander Moreno,Julia A. Kuhn,Zia Khan
DOI: https://doi.org/10.1101/2024.11.11.621189
2024-11-14
Abstract:Gene set overrepresentation analysis is a widely used approach for interpreting high-throughput transcriptomics and proteomics data. However, traditional methods rely on static, human-curated gene set databases that limit flexibility. We introduce , a framework that leverages large language models (LLMs) to dynamically generate gene sets from natural language descriptions. Through benchmarking against curated gene set databases, we show that LLM-generated gene sets are significantly overrepresented in corresponding human-curated gene sets. Additionally, LLMs can propose multiple biological processes for input sets of differentially expressed genes (DEGs), enabling the identification of overlapping pathways. Applying to RNA-seq data from iPSC-derived microglia treated with a agonist reveals more interpretable and relevant pathways compared to static databases, demonstrating the potential of LLMs for flexible, context-aware gene set generation. This approach enhances hypothesis generation and improves the interpretation of high-throughput biological data. is available as open source at: .
Bioinformatics
What problem does this paper attempt to address?