Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

Peng Ye,Weiqiang Bai,Yuchen Ren,Wenran Li,Lifeng Qiao,Chaoqi Liang,Linxiao Wang,Yuchen Cai,Jianle Sun,Zejun Yang,Peng Zheng,Nanqing Dong,Tao Chen,Zhihui Wang,Xihui Liu,Xinzhu Ma,Hongliang Yan,Zhen Wang,Sijia Wang,Wanli Ouyang
DOI: https://doi.org/10.1101/2024.07.16.603653
2024-07-19
Abstract:Artificial intelligence (AI) plays a crucial role in genomic analysis, offering great potential for comprehending biological phenomena such as heredity, development, diseases, and evolution. However, the development of AI models needs substantial labeled data, and these models are typically task-specific with limited generalizability to various applications. Here, we develop Genomics-FM, a genomic vocabulary driven foundation model that enables versatile and label-efficient functional genomic analysis. Specifically, Genomics-FM is first pretrained with ensemble genomic vocabulary on vast unlabelled data to learn comprehensive and generalizable representations and then finetuned with specific genomic vocabulary on limited labeled data to selectively activate and adapt the pretraining knowledge for specific tasks. We show that Genomics-FM significantly reduces the dependence on labeled data, and demonstrates the capability to outperform existing models across a comprehensive suite of tasks including genome annotation, epigenomic and expression profile prediction, and variant effect assessment. Remarkably, Genomics-FM even shows impressive zero-shot predictive capabilities across diverse species and tissues and exhibits noticeable adaptability to RNA-related tasks. With feasibility in data scarcity and even cross-domain biological scenarios, Genomics-FM will promote the broad application of AI and empower researchers to tackle previously insurmountable challenges, paving the way for groundbreaking research and discoveries.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low data efficiency and task - specificity of current artificial intelligence (AI) models in functional genomics analysis. Specifically, existing AI models usually require a large amount of labeled data, and these models are often optimized for specific tasks and lack the generalization ability for different applications. This has led to two main problems: 1. **High data - dependence**: Most existing AI models rely on a large amount of high - quality labeled data, and the acquisition of these data requires careful evaluation by experts and a great deal of labor. Especially in the field of genomics, due to the complexity and diversity of data, many important genomic data sets are still unlabeled or under - utilized. 2. **Task - specificity limitation**: Existing models are usually trained for specific tasks and are difficult to transfer and generalize between different genomic tasks, which limits their application scope and efficiency. To address these problems, the paper proposes a new base model - Genomics - FM. It is pre - trained by using an ensemble genomic vocabulary and fine - tuned with specific genomic vocabularies for specific tasks, thereby achieving multi - functional and label - efficient genomic function analysis. This method not only reduces the dependence on labeled data but also improves the performance of the model on multiple genomic tasks, including genomic annotation, epigenetic and expression profile prediction, and variant effect assessment. In addition, Genomics - FM also demonstrates the prediction ability across species and tissues, further verifying its adaptability in data - scarce and cross - domain biological scenarios.