Abstract:Gene-gene interactions play a crucial role in the manifestation of complex human diseases. Uncovering significant gene-gene interactions is a challenging task. Here, we present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions. Despite the efficacy of Transformer models, their parameter intensity presents a bottleneck in data ingestion, hindering data efficiency. To mitigate this, we introduce a novel weighted diversified sampling algorithm. This algorithm computes the diversity score of each data sample in just two passes of the dataset, facilitating efficient subset generation for interaction discovery. Our extensive experimentation demonstrates that by sampling a mere 1\% of the single-cell dataset, we achieve performance comparable to that of utilizing the entire dataset.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of efficiently discovering important gene - gene interactions (GGI) from single - cell gene expression data. Specifically, the authors propose an innovative data - driven method, using the advanced Transformer model to reveal significant gene - gene interactions. However, due to the large number of parameters in the Transformer model, it faces bottlenecks in data ingestion and processing efficiency when dealing with large - scale single - cell transcriptome data. #### Main challenges: 1. **Low data processing efficiency**: Although the existing Transformer models are excellent at capturing the dependencies between gene expressions, their large number of parameters makes data processing very time - consuming, especially when hardware resources are limited. 2. **Existing models rely on prior knowledge**: Many existing models rely on transcription factors (TF) or known gene - gene interaction networks (GGI). These methods are prone to high false - positive rates and biases, especially in large - scale in vitro experiments. #### Solutions: To address the above challenges, the authors propose a new weighted diversified sampling algorithm (Weighted Diversified Sampling, WDS). This algorithm calculates the diversity score of each data sample and generates an efficient subset with only two passes through the data set, thus achieving efficient gene - gene interaction discovery. #### Specific methods: 1. **Introduce the CelluFormer model**: This is a Transformer model specifically designed for single - cell transcriptome data and is able to learn gene - gene interactions. 2. **Calculate Min - Max density**: Evaluate the diversity of each cell by defining Min - Max similarity and Min - Max density. 3. **Weighted diversified sampling**: Use the inverse Min - Max density as the diversity score for weighted sampling to select a representative subset. #### Experimental results: Through extensive experimental verification, the authors prove that by sampling only 1% of the single - cell data set, performance comparable to that using the entire data set can be achieved, thus greatly improving data processing efficiency. ### Summary: The main contribution of this paper is to propose an efficient data - driven method, solve the data processing bottleneck problem in large - scale single - cell transcriptome data analysis, and achieve efficient gene - gene interaction discovery through the weighted diversified sampling algorithm.

Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery

Leveraging hierarchical structures for genetic block interaction studies using the hierarchical transformer

Self-supervised deep learning of gene–gene interactions for improved gene expression recovery

Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization

Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity

Key Gene Mining in Transcriptional Regulation for Specific Biological Processes with Small Sample Sizes Using Multi-network pipeline Transformer

Single-Cell Deep Clustering Method Assisted by Exogenous Gene Information: A Novel Approach to Identifying Cell Types

BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies

From local to global gene co-expression estimation using single-cell RNA-seq data

Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data

Cartography of Genomic Interactions Enables Deep Analysis of Single-Cell Expression Data

TransCell: In silico Characterization of Genomic Landscape and Cellular Responses by Deep Transfer Learning

Leveraging cross-source heterogeneity to improve the performance of bulk gene expression deconvolution

scGREAT: Transformer-Based Deep-Language Model for Gene Regulatory Network Inference from Single-Cell Transcriptomics

The role of the maturation-promoting factor in controlling protein synthesis in Xenopus oocytes.

KDGene: knowledge graph completion for disease gene prediction using interactional tensor decomposition

Pathformer: biological pathway informed Transformer model integrating multi-modal data

ECD-CDGI: An efficient energy-constrained diffusion model for cancer driver gene identification

dynDeepDRIM: a dynamic deep learning model to infer direct regulatory interactions using single cell time-course gene expression data

Identifying complex gene–gene interactions: a mixed kernel omnibus testing approach

FEED: a feature selection method based on gene expression decomposition for single cell clustering