Multimodal data integration and cross-modal querying via orchestrated approximate message passing

Sagnik Nandy,Zongming Ma
2024-08-24
Abstract:The need for multimodal data integration arises naturally when multiple complementary sets of features are measured on the same sample. Under a dependent multifactor model, we develop a fully data-driven orchestrated approximate message passing algorithm for integrating information across these feature sets to achieve statistically optimal signal recovery. In practice, these reference data sets are often queried later by new subjects that are only partially observed. Leveraging on asymptotic normality of estimates generated by our data integration method, we further develop an asymptotically valid prediction set for the latent representation of any such query subject. We demonstrate the prowess of both the data integration and the prediction set construction algorithms on a tri-modal single-cell dataset.
Methodology,Statistics Theory
What problem does this paper attempt to address?
The problem this paper attempts to address is the integration of multimodal datasets and their cross-modal querying. Specifically, the authors propose a fully data-driven Orchestrated Approximate Message Passing algorithm (OrchAMP) to integrate information from different feature sets in single-cell multi-omics studies and achieve statistically optimal signal recovery. Additionally, when new samples have only partial features observed, the reference dataset generated using this method can predict the latent factor representations of these new samples and quantify the uncertainty in the predictions. The authors validated the effectiveness of this algorithm on a trimodal single-cell dataset, demonstrating its superior performance in data integration and prediction set construction. Compared to existing methods such as Weighted Nearest Neighbors (WNN), CiteFuse, MOFA+, and totalVI, this method not only has faster computational speed and theoretical statistical optimality but also provides confidence intervals to assess the uncertainty of the prediction results. This is a feature that existing methods do not possess.