Abstract:Bioinformatics has witnessed a paradigm shift with the increasing integration of artificial intelligence (AI), particularly through the adoption of foundation models (FMs). These AI techniques have rapidly advanced, addressing historical challenges in bioinformatics such as the scarcity of annotated data and the presence of data noise. FMs are particularly adept at handling large-scale, unlabeled data, a common scenario in biological contexts due to the time-consuming and costly nature of experimentally determining labeled data. This characteristic has allowed FMs to excel and achieve notable results in various downstream validation tasks, demonstrating their ability to represent diverse biological entities effectively. Undoubtedly, FMs have ushered in a new era in computational biology, especially in the realm of deep learning. The primary goal of this survey is to conduct a systematic investigation and summary of FMs in bioinformatics, tracing their evolution, current research status, and the methodologies employed. Central to our focus is the application of FMs to specific biological problems, aiming to guide the research community in choosing appropriate FMs for their research needs. We delve into the specifics of the problem at hand including sequence analysis, structure prediction, function annotation, and multimodal integration, comparing the structures and advancements against traditional methods. Furthermore, the review analyses challenges and limitations faced by FMs in biology, such as data noise, model explainability, and potential biases. Finally, we outline potential development paths and strategies for FMs in future biological research, setting the stage for continued innovation and application in this rapidly evolving field. This comprehensive review serves not only as an academic resource but also as a roadmap for future explorations and applications of FMs in biology.

BioLLM: A Standardized Framework for Integrating and Benchmarking Single-Cell Foundation Models

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics

scReader: Prompting Large Language Models to Interpret scRNA-seq Data

The Development of AI Foundation Models for Single-Cell Transcriptomics

Progress and Opportunities of Foundation Models in Bioinformatics

Large-scale foundation model on single-cell transcriptomics

scInterpreter: Training Large Language Models to Interpret scRNA-seq Data for Cell Type Annotation

scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

CELLama: Foundation Model for Single Cell and Spatial Transcriptomics by Cell Embedding Leveraging Language Model Abilities

scMODAL: A general deep learning framework for comprehensive single-cell multi-omics data alignment with feature links

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Sclm: Automatic Detection of Consensus Gene Clusters Across Multiple Single-Cell Datasets

Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data

GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models

scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets

f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq

FIRM: Flexible Integration of single-cell RNA-sequencing data for large-scale Multi-tissue cell atlas datasets