Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery

Yifan Wu,Yuntao Yang,Zirui Liu,Zhao Li,Khushbu Pahwa,Rongbin Li,Wenjin Zheng,Xia Hu,Zhaozhuo Xu
2024-10-21
Abstract:Gene-gene interactions play a crucial role in the manifestation of complex human diseases. Uncovering significant gene-gene interactions is a challenging task. Here, we present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth noteworthy gene-gene interactions. Despite the efficacy of Transformer models, their parameter intensity presents a bottleneck in data ingestion, hindering data efficiency. To mitigate this, we introduce a novel weighted diversified sampling algorithm. This algorithm computes the diversity score of each data sample in just two passes of the dataset, facilitating efficient subset generation for interaction discovery. Our extensive experimentation demonstrates that by sampling a mere 1\% of the single-cell dataset, we achieve performance comparable to that of utilizing the entire dataset.
Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of efficiently discovering important gene - gene interactions (GGI) from single - cell gene expression data. Specifically, the authors propose an innovative data - driven method, using the advanced Transformer model to reveal significant gene - gene interactions. However, due to the large number of parameters in the Transformer model, it faces bottlenecks in data ingestion and processing efficiency when dealing with large - scale single - cell transcriptome data. #### Main challenges: 1. **Low data processing efficiency**: Although the existing Transformer models are excellent at capturing the dependencies between gene expressions, their large number of parameters makes data processing very time - consuming, especially when hardware resources are limited. 2. **Existing models rely on prior knowledge**: Many existing models rely on transcription factors (TF) or known gene - gene interaction networks (GGI). These methods are prone to high false - positive rates and biases, especially in large - scale in vitro experiments. #### Solutions: To address the above challenges, the authors propose a new weighted diversified sampling algorithm (Weighted Diversified Sampling, WDS). This algorithm calculates the diversity score of each data sample and generates an efficient subset with only two passes through the data set, thus achieving efficient gene - gene interaction discovery. #### Specific methods: 1. **Introduce the CelluFormer model**: This is a Transformer model specifically designed for single - cell transcriptome data and is able to learn gene - gene interactions. 2. **Calculate Min - Max density**: Evaluate the diversity of each cell by defining Min - Max similarity and Min - Max density. 3. **Weighted diversified sampling**: Use the inverse Min - Max density as the diversity score for weighted sampling to select a representative subset. #### Experimental results: Through extensive experimental verification, the authors prove that by sampling only 1% of the single - cell data set, performance comparable to that using the entire data set can be achieved, thus greatly improving data processing efficiency. ### Summary: The main contribution of this paper is to propose an efficient data - driven method, solve the data processing bottleneck problem in large - scale single - cell transcriptome data analysis, and achieve efficient gene - gene interaction discovery through the weighted diversified sampling algorithm.