Self-supervised deep learning of gene–gene interactions for improved gene expression recovery

Qingyue Wei,Md Tauhidul Islam,Yuyin Zhou,Lei Xing
DOI: https://doi.org/10.1093/bib/bbae031
IF: 9.5
2024-02-14
Briefings in Bioinformatics
Abstract:Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene–gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene–gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene–gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inaccuracy of gene expression values in single - cell RNA sequencing (scRNA - seq) data. Due to the limitations of existing sequencing technologies, low gene expression values are often missed, resulting in inaccurate gene counts. Existing methods, including advanced deep - learning techniques, have difficulties in reliably inferring gene expression because these methods lack a mechanism that explicitly considers the underlying biological knowledge of the system. Specifically, the paper proposes a new self - supervised deep - learning framework to improve the accuracy of gene expression recovery by leveraging gene - gene interactions. ### Main problems 1. **Inaccuracy of gene expression data**: Due to technical limitations, low gene expression values are often missed in scRNA - seq data, resulting in inaccurate gene counts. 2. **Deficiencies of existing methods**: Existing gene expression recovery methods have limitations in accuracy and computational efficiency and are difficult to fully capture the long - range relationships between genes. ### Solutions The paper proposes a framework named TCER (Transform - and - Conquer Expression Recovery) to solve the above problems through the following steps: 1. **Mapping of gene - gene interactions**: - Rearrange genes into a 2D grid so that their spatial configuration reflects their interaction relationships. Specifically, genes with strong interactions are closer in GenoMap. - Use the Gromov - Wasserstein divergence minimization method to obtain the optimal projection matrix \(T\) to reconstruct the gene data into a 2D grid. 2. **Self - supervised deep - learning model**: - Design a deep neural network with an encoder - decoder structure named ER - Net for recovering gene expression values. - Introduce three cascaded Deformable Fusion Attention (DFA) modules in ER - Net to extract local and global gene - gene interaction features. - Use a dual - attention mechanism (channel attention and pixel attention) to adaptively allocate important feature information and improve the performance of the network. ### Experimental results The paper conducted extensive experiments on simulated and actual scRNA - seq datasets, demonstrating the superior performance of the TCER method in gene expression recovery, cell clustering, and trajectory analysis. Compared with existing methods, TCER shows significant advantages in multiple metrics, especially in Pearson correlation coefficient and UMAP visualization results. ### Formulas - **Gene - gene interaction intensity matrix \(C\)**: \[ C_{ij} = \begin{cases} - \frac{(\Omega^{-1})_{ij}}{\sqrt{(\Omega^{-1})_{ii} (\Omega^{-1})_{jj}}} & \text{if } i \neq j \\ 1 & \text{if } i = j \end{cases} \] where \(\Omega\) is the covariance matrix, and \(\Omega_{ij}\) represents the covariance of the expression values of the \(i\)-th gene and the \(j\)-th gene in all cells. - **Gromov - Wasserstein divergence**: \[ GW(C, \bar{C}, u, v) = \min_T E_{C,\bar{C}}(T) \] where \[ E_{C,\bar{C}}(T) = \sum_{i,j,k,l} L(C_{ik}, \bar{C}_{jl}) T_{ij} T_{kl} \] \[ L(a, b) = KL(a | b) = a \log \left( \frac{a}{b} \right) - a + b \] - **Standard convolution operation**: \[ F_{\text{std}}^{\text{out}}(p_x, p_y) =