Non-negative matrix factorization and deconvolution as dual simplex problem

Denis Kleverov,Ekaterina Aladyeva,Alexey Serdyukov,Maxim N. Artyomov
DOI: https://doi.org/10.1101/2024.04.09.588652
2024-04-12
Abstract:Non-negative matrix factorization (NMF) is one of the most powerful linear algebra tools, which has found application in various areas of data analysis, including computational biology. Despite numerous optimization methods devised for NMF, our comprehension of the inherent topological structure within factorizable matrices remains limited. In this work, we reveal the topological properties of the linear mixture data, which allow for a remarkable reduction in the dimensionality of the NMF problem and reformulation of the NMF problem as an optimization problem with only ( −1)variables, with K representing the number of pure components, irrespective of the initial data matrix dimensionality. This is achieved by uncovering the dual simplex structure of the data, with complementary simplex structures existing in both the features’ and samples’ spaces and leveraging the Sinkhorn transformation to uncover the relationship between these simplexes. We validate this approach in the context of an unconstrained general mixed images scenario and achieve a significant improvement in decomposition accuracy. Furthermore, we successfully apply the proposed approach in the biological context of bulk RNA-seq gene expression data unmixing and single-cell RNA-seq data clustering.
Bioinformatics
What problem does this paper attempt to address?
This paper attempts to solve several key problems in non - negative matrix factorization (NMF): 1. **Understanding the Intrinsic Topological Structure of Data**: Although various optimization methods have been applied to NMF, the understanding of the topological structure within the factorized matrices remains limited. The paper reveals the topological properties of linearly mixed data, which allow for a significant reduction in the dimension of the NMF problem and reformulate it as an optimization problem containing only \(K(K - 1)\) variables, where \(K\) represents the number of pure components and is independent of the dimension of the initial data matrix. 2. **Simplifying the Dimension of the NMF Problem**: By discovering the dual simplex structure in the data, the paper proposes a method that can reduce the NMF problem from a high - dimensional space to a low - dimensional space, thereby greatly simplifying the computational task. 3. **Improving the Decomposition Accuracy**: The paper verifies this method in general mixed - image scenarios and achieves a significant improvement in decomposition accuracy. In addition, this method also performs well in the demixing of bulk RNA - seq gene expression data and the clustering of single - cell RNA - seq data in the biological field. ### Specific Problems and Solutions - **Problem 1: Processing of High - Dimensional Data** - **Solution**: By revealing the dual simplex structure in the data, the paper reduces the NMF problem from a high - dimensional space to a low - dimensional space, thereby reducing the number of optimization variables and making the computation more efficient. - **Problem 2: Complexity of the Optimization Problem** - **Solution**: The paper proposes a unified optimization framework that simultaneously considers the simplex structures of the feature space and the sample space and solves it by numerical methods such as gradient descent, further simplifying the optimization problem. - **Problem 3: Performance in Practical Applications** - **Solution**: The paper verifies the effectiveness of the method through multiple experiments, including image demixing, single - cell RNA - seq data clustering, and bulk RNA - seq data demixing. The results show that this method can significantly improve the decomposition accuracy under different noise levels. ### Main Contributions 1. **Theoretical Contribution**: Reveals the topological properties of linearly mixed data and proposes the Dual Simplex Theorem, providing a new theoretical basis for the NMF problem. 2. **Methodological Contribution**: Develops an optimization method based on the dual simplex structure, significantly reducing the number of optimization variables and improving computational efficiency. 3. **Application Contribution**: Verifies the effectiveness of the method in multiple practical application scenarios, especially outstanding in biomedical data analysis. Through these contributions, the paper not only promotes the understanding of the NMF problem but also provides powerful tools and support for practical applications.