CelLink: integrating single-cell multi-omics data with weak feature linkage and imbalanced cell populations
Xin Luo,Yuanhao Huang,Yicheng Tao,Fan Feng,Alexander Hopkirk,Thomas S.R. Bate,Diane C. Saunders,Peter Orchard,Catherine Robertson,Shristi Shrestha,Cartailler Jean-Philippe,Stephen Parker,Marcela Brissova,Jie Liu
DOI: https://doi.org/10.1101/2024.11.08.622745
2024-11-22
Abstract:Single-cell multi-omics technologies capture complementary molecular layers, enabling a comprehensive view of cellular states and functions. However, integrating these data types poses significant challenges when their features are weakly linked and cell population sizes are imbalanced. Currently, no method efficiently addresses these two issues simultaneously. Therefore, we developed CelLink, a novel single-cell multi-omics data integration method designed to overcome these challenges. CelLink normalizes and smooths feature profiles to align scales across datasets and integrates them through a multi-phase pipeline that iteratively employs the optimal transport algorithm. It dynamically refines cell-cell correspondences, identifying and excluding cells that cannot be reliably matched, thus avoiding performance degradation caused by erroneous imputations. This approach effectively adapts to weak feature linkage and imbalanced cell populations between datasets. Benchmarking CelLink on scRNA-seq and spatial proteomics datasets, as well as paired CITE-seq data, demonstrates its superior performance across various evaluation metrics, including data mixing, cell manifold structure preservation, and feature imputation accuracy. Compared to state-of-the-art methods, CelLink significantly outperforms others in imbalanced cell populations while consistently achieving better performance for balanced datasets. Moreover, CelLink uniquely enables cell subtype annotation, correction of mislabelled cells, and spatial transcriptomic analyses by imputing transcriptomic profiles for spatial proteomics data. CelLink sets a new milestone for multi-omics data integration. Its great ability to impute paired single-cell multi-omics profiles positions it as a pivotal tool for building single-cell multi-modal foundation models and advancing spatial cellular biology.
Biology