Dynamic programming algorithms for fast and accurate cell lineage tree reconstruction from CRISPR-based lineage tracing data
Junyan Dai,Erin Molloy
DOI: https://doi.org/10.1101/2024.11.15.623872
2024-11-16
Abstract:CRISPR-based lineage tracing, coupled with single-cell RNA sequencing, has emerged as a promising approach for studying cell transformations during development as well as disease progression. However, the high ratio of cells to CRISPR-induced mutations, combined with missing data from silencing or dropout, make cell lineage tree (CLT) reconstruction difficult. As a result, this computational problem has attracted significant attention in recent years, including the introduction of Star Homoplasy Parsimony (SHP) in 2023 to model the specific properties of CRISPR-induced mutations, along with the Startle family of methods based on integer linear programming (ILP) or heuristic search (NNI). Here, we present Star-CDP, the first dynamic programming algorithm for SHP. Star-CDP solves SHP within a constrained search space $\Sigma$ defined by subsets of cells from which a solution CLT must draw its clades. When $\Sigma$ is the power set, Star-CDP is an exact exponential algorithm with time complexity $O(nm|\Sigma|^2)$, where $n$ is the number of cells, $m$ is the number of target sites, and $|\Sigma| = O(2^n)$. We show that it is possible to build clade constraints that are polynomially-sized and effective in practice. Motivated by the technological challenges in producing consistent phylogenetic signal across the tree during lineage tracing, we also present algorithms to efficiently count, sample, and build consensus trees from all solutions to the clade-constrained SHP problem. In simulations, Star-CDP's strict consensus effectively reduced false positive branches while preserving many more true positives compared to the standard strict consensus implemented by PAUP*, a popular parsimony method from species phylogenetics. Likewise, Star-CDP's strict consensus achieved the same or higher accuracy (f1-score) on all but one of the 15 model conditions tested, often outperforming leading the methods, Startle-ILP and Startle-NNI, while also scaling to larger data sets than Startle-ILP. Lastly, we analyzed lineage tracing data from the KP-Tracer mouse model of lung adenocarcinoma, finding that Star-CDP produced plausible CLTs, often lowering the number of migration and reseeding events needed to explain metastases compared to Startle. Our analysis also showed, for the first time, that strategies for preprocessing cells with missing data---specifically cell pruning and deduplicating techniques---can have a substantial impact on CLTs reconstructed with the same method, even changing relative performance across methods compared to previously published results. The same was true of postprocessing trees with LAML, a maximum likelihood method designed for mixed-type missing data. By exploring these different pipelines, we recovered the most plausible CLT for the largest KP-Tracer metastatic tumor, reducing the number of reseeding events from 42 to 10 without increasing the number of migrations. Star-CDP is available on Github: https://github.com/molloy-lab/Star-CDP.
Bioinformatics