Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

Peng Ye,Weiqiang Bai,Yuchen Ren,Wenran Li,Lifeng Qiao,Chaoqi Liang,Linxiao Wang,Yuchen Cai,Jianle Sun,Zejun Yang,Peng Zheng,Nanqing Dong,Tao Chen,Zhihui Wang,Xihui Liu,Xinzhu Ma,Hongliang Yan,Zhen Wang,Sijia Wang,Wanli Ouyang

DOI: https://doi.org/10.1101/2024.07.16.603653

2024-07-19

Abstract:Artificial intelligence (AI) plays a crucial role in genomic analysis, offering great potential for comprehending biological phenomena such as heredity, development, diseases, and evolution. However, the development of AI models needs substantial labeled data, and these models are typically task-specific with limited generalizability to various applications. Here, we develop Genomics-FM, a genomic vocabulary driven foundation model that enables versatile and label-efficient functional genomic analysis. Specifically, Genomics-FM is first pretrained with ensemble genomic vocabulary on vast unlabelled data to learn comprehensive and generalizable representations and then finetuned with specific genomic vocabulary on limited labeled data to selectively activate and adapt the pretraining knowledge for specific tasks. We show that Genomics-FM significantly reduces the dependence on labeled data, and demonstrates the capability to outperform existing models across a comprehensive suite of tasks including genome annotation, epigenomic and expression profile prediction, and variant effect assessment. Remarkably, Genomics-FM even shows impressive zero-shot predictive capabilities across diverse species and tissues and exhibits noticeable adaptability to RNA-related tasks. With feasibility in data scarcity and even cross-domain biological scenarios, Genomics-FM will promote the broad application of AI and empower researchers to tackle previously insurmountable challenges, paving the way for groundbreaking research and discoveries.

Bioinformatics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the low data efficiency and task - specificity of current artificial intelligence (AI) models in functional genomics analysis. Specifically, existing AI models usually require a large amount of labeled data, and these models are often optimized for specific tasks and lack the generalization ability for different applications. This has led to two main problems: 1. **High data - dependence**: Most existing AI models rely on a large amount of high - quality labeled data, and the acquisition of these data requires careful evaluation by experts and a great deal of labor. Especially in the field of genomics, due to the complexity and diversity of data, many important genomic data sets are still unlabeled or under - utilized. 2. **Task - specificity limitation**: Existing models are usually trained for specific tasks and are difficult to transfer and generalize between different genomic tasks, which limits their application scope and efficiency. To address these problems, the paper proposes a new base model - Genomics - FM. It is pre - trained by using an ensemble genomic vocabulary and fine - tuned with specific genomic vocabularies for specific tasks, thereby achieving multi - functional and label - efficient genomic function analysis. This method not only reduces the dependence on labeled data but also improves the performance of the model on multiple genomic tasks, including genomic annotation, epigenetic and expression profile prediction, and variant effect assessment. In addition, Genomics - FM also demonstrates the prediction ability across species and tissues, further verifying its adaptability in data - scarce and cross - domain biological scenarios.

Genomics-FM: Universal Foundation Model for Versatile and Data-Efficient Functional Genomic Analysis

Progress and Opportunities of Foundation Models in Bioinformatics

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model

LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

Artificial intelligence-driven biomedical genomics

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

AutoGenome: an AutoML Tool for Genomic Research

Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

Enhancing Personalized Gene Expression Prediction From DNA Sequences Using Genomic Foundation Models

scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

Artificial Intelligence, Physiological Genomics, and Precision Medicine.

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

L2G: Repurposing Language Models for Genomics Tasks

Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models

The Development of AI Foundation Models for Single-Cell Transcriptomics

FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics