Abstract:Accurate predictive modeling of human gene relationships would fundamentally transform our ability to uncover the molecular mechanisms that underpin key biological and disease processes. Recent studies have employed advanced AI techniques to model the complexities of gene networks using large gene expression datasets. However, the extent and nature of the biological information these models can learn is not fully understood. Furthermore, the potential for improving model performance by using alternative data types, model architectures, and methodologies remains underexplored. Here, we developed GeneRAIN models by training on a large dataset of 410K human bulk RNA-seq samples, rather than single-cell RNA-seq datasets used by most previous studies. We showed that although the models were trained only on gene expression data, they learned a wide range of biological information well beyond gene expression. We introduced GeneRAIN-vec, a state-of-the-art, multifaceted vectorized representation of genes. Further, we demonstrated the capabilities and broad applicability of this approach by making 4,797 biological attribute predictions for each of 13,030 long non-coding RNAs (62.5 million predictions in total). These achievements stem from various methodological innovations, including experimenting with multiple model architectures and a new 'Binning-By-Gene' normalization method. Comprehensive evaluation of our models clearly demonstrated that they significantly outperformed current state-of-the-art models. This study improves our understanding of the capabilities of Transformer and self-supervised deep learning when applied to extensive expression data. Our methodological advancements offer crucial insights into refining these techniques. These innovations are set to significantly advance our understanding and exploration of biology.

What problem does this paper attempt to address?

The problems this paper attempts to address include: 1. **Understanding the complexity of gene networks**: Although existing studies have used advanced AI techniques to model gene expression data, the scope and nature of the biological information that these models can learn are not yet fully understood. 2. **Exploring methods to improve model performance**: Currently, most studies primarily use single-cell RNA sequencing data to train models, while the potential of using bulk RNA sequencing data has not been fully explored. Additionally, the effectiveness of different model architectures and methods has not been adequately evaluated. 3. **Developing multifaceted gene representations**: The research aims to extract rich biological information from large amounts of gene expression data through deep learning techniques, thereby developing a multifaceted gene vector representation (GeneRAIN-vec). 4. **Predicting biological properties of long non-coding RNAs**: The developed model is used to predict 4,797 biological properties of 13,030 long non-coding RNAs (lncRNAs), with a total of 62.5 million predictions. Specifically, the paper addresses these issues in the following ways: - **Dataset selection**: A dataset of 410,850 human bulk RNA sequencing samples was used, instead of the commonly used single-cell RNA sequencing data. - **Model architecture**: Various model architectures were experimented with, including BERT and GPT models, and a new "Binning-By-Gene" normalization method was introduced. - **Performance evaluation**: The learning ability of the models was comprehensively evaluated using multiple metrics (such as ARI, FMI, NMI) and compared with existing state-of-the-art models (such as Geneformer and Gene2vec). - **Application validation**: The model's broad applicability and high performance were demonstrated in predicting gene biological properties, simulating gene perturbation responses, and predicting biological properties of long non-coding RNAs. In summary, through methodological innovation and comprehensive evaluation, this paper significantly enhances the understanding of gene network complexity and model performance, providing new tools and perspectives for biological research.

Multifaceted Representation of Genes via Deep Learning of Gene Expression Networks

Enhancing Personalized Gene Expression Prediction From DNA Sequences Using Genomic Foundation Models

Genes in Humans and Mice: Insights from Deep learning of 777K Bulk Transcriptomes

Transfer learning enables predictions in network biology

Enhancing Gene Expression Predictions Using Deep Learning and Functional Annotations

Gene Expression Prediction based on Deep Learning

A Generative Adversarial Network Model for Disease Gene Prediction With RNA-seq Data

A deep auto-encoder model for gene expression prediction

Gene-language models are whole genome representation learners

Deep Large-Scale Multitask Learning Network for Gene Expression Inference

scGREAT: Transformer-Based Deep-Language Model for Gene Regulatory Network Inference from Single-Cell Transcriptomics

Deep-learning prediction of gene expression from personal genomes

Predicting the genetic component of gene expression using gene regulatory networks

DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

Modeling gene regulatory networks using neural network architectures

GENet: A Graph-Based Model Leveraging Histone Marks and Transcription Factors for Enhanced Gene Expression Prediction

Effective gene expression prediction from sequence by integrating long-range interactions

DeepIMAGER: Deeply Analyzing Gene Regulatory Networks from scRNA-seq Data

Transformer for Gene Expression Modeling (T-GEM): An Interpretable Deep Learning Model for Gene Expression-Based Phenotype Predictions

A genome-scale deep learning model to predict gene expression changes of genetic perturbations from multiplex biological networks

Biological Factor Regulatory Neural Network