Dynamic programming algorithms for fast and accurate cell lineage tree reconstruction from CRISPR-based lineage tracing data

Junyan Dai,Erin Molloy
DOI: https://doi.org/10.1101/2024.11.15.623872
2024-11-16
Abstract:CRISPR-based lineage tracing, coupled with single-cell RNA sequencing, has emerged as a promising approach for studying cell transformations during development as well as disease progression. However, the high ratio of cells to CRISPR-induced mutations, combined with missing data from silencing or dropout, make cell lineage tree (CLT) reconstruction difficult. As a result, this computational problem has attracted significant attention in recent years, including the introduction of Star Homoplasy Parsimony (SHP) in 2023 to model the specific properties of CRISPR-induced mutations, along with the Startle family of methods based on integer linear programming (ILP) or heuristic search (NNI). Here, we present Star-CDP, the first dynamic programming algorithm for SHP. Star-CDP solves SHP within a constrained search space $\Sigma$ defined by subsets of cells from which a solution CLT must draw its clades. When $\Sigma$ is the power set, Star-CDP is an exact exponential algorithm with time complexity $O(nm|\Sigma|^2)$, where $n$ is the number of cells, $m$ is the number of target sites, and $|\Sigma| = O(2^n)$. We show that it is possible to build clade constraints that are polynomially-sized and effective in practice. Motivated by the technological challenges in producing consistent phylogenetic signal across the tree during lineage tracing, we also present algorithms to efficiently count, sample, and build consensus trees from all solutions to the clade-constrained SHP problem. In simulations, Star-CDP's strict consensus effectively reduced false positive branches while preserving many more true positives compared to the standard strict consensus implemented by PAUP*, a popular parsimony method from species phylogenetics. Likewise, Star-CDP's strict consensus achieved the same or higher accuracy (f1-score) on all but one of the 15 model conditions tested, often outperforming leading the methods, Startle-ILP and Startle-NNI, while also scaling to larger data sets than Startle-ILP. Lastly, we analyzed lineage tracing data from the KP-Tracer mouse model of lung adenocarcinoma, finding that Star-CDP produced plausible CLTs, often lowering the number of migration and reseeding events needed to explain metastases compared to Startle. Our analysis also showed, for the first time, that strategies for preprocessing cells with missing data---specifically cell pruning and deduplicating techniques---can have a substantial impact on CLTs reconstructed with the same method, even changing relative performance across methods compared to previously published results. The same was true of postprocessing trees with LAML, a maximum likelihood method designed for mixed-type missing data. By exploring these different pipelines, we recovered the most plausible CLT for the largest KP-Tracer metastatic tumor, reducing the number of reseeding events from 42 to 10 without increasing the number of migrations. Star-CDP is available on Github: https://github.com/molloy-lab/Star-CDP.
Bioinformatics
What problem does this paper attempt to address?
This paper aims to address the challenges encountered when constructing cell lineage trees (CLT) based on CRISPR lineage - tracing data. Specifically, the paper focuses on the following issues: 1. **High ratio of cells to CRISPR - induced mutations**: When the number of cells is much larger than the number of CRISPR - induced mutations, it becomes difficult to construct an accurate cell lineage tree. 2. **The problem of missing data**: Due to gene silencing or the dropout phenomenon in single - cell sequencing, some data are lost, which further increases the difficulty of constructing cell lineage trees. 3. **Limitations of existing methods**: Existing methods such as Startle - ILP and Startle - NNI perform well in some aspects, but they have problems such as high computational complexity, inability to guarantee the optimal solution, and sensitivity to low - information content and missing data. To meet these challenges, the paper proposes a new dynamic programming algorithm - Star - CDP. The main contributions of Star - CDP include: - **Introduction of a constrained search space**: By defining a set Σ consisting of cell subsets, the topological structures of possible cell lineage trees are restricted, thereby reducing the search space. - **Efficient dynamic programming algorithm**: Star - CDP can solve the constrained large - scale star - homology parsimony problem (CC - LSHP) in polynomial time, with a time complexity of O(nm|Σ|^1.726 + n|Σ|^2), where n is the number of cells, m is the number of target sites, and |Σ| is the size of the set Σ. - **Handling low - information content and missing data**: The paper proposes a series of algorithms to efficiently count, sample, and construct consensus trees to address the challenges brought by low - information content and missing data. Through these methods, Star - CDP performs well in simulation experiments, can effectively reduce false - positive branches while retaining more true - positive branches, and has higher accuracy compared to other methods such as PAUP*, Startle - ILP, and Startle - NNI. In addition, Star - CDP also shows its superiority when analyzing lung cancer metastasis data of the KP - Tracer mouse model, being able to reduce the number of replay events without increasing the number of migration events. In conclusion, by introducing a new dynamic programming algorithm and methods for handling low - information content and missing data, this paper significantly improves the accuracy and efficiency of constructing cell lineage trees based on CRISPR lineage - tracing data.