Dynamic programming algorithms for fast and accurate cell lineage tree reconstruction from CRISPR-based lineage tracing data

Junyan Dai,Erin Molloy

DOI: https://doi.org/10.1101/2024.11.15.623872

2024-11-16

Abstract:CRISPR-based lineage tracing, coupled with single-cell RNA sequencing, has emerged as a promising approach for studying cell transformations during development as well as disease progression. However, the high ratio of cells to CRISPR-induced mutations, combined with missing data from silencing or dropout, make cell lineage tree (CLT) reconstruction difficult. As a result, this computational problem has attracted significant attention in recent years, including the introduction of Star Homoplasy Parsimony (SHP) in 2023 to model the specific properties of CRISPR-induced mutations, along with the Startle family of methods based on integer linear programming (ILP) or heuristic search (NNI). Here, we present Star-CDP, the first dynamic programming algorithm for SHP. Star-CDP solves SHP within a constrained search space $\Sigma$ defined by subsets of cells from which a solution CLT must draw its clades. When $\Sigma$ is the power set, Star-CDP is an exact exponential algorithm with time complexity $O(nm|\Sigma|^2)$, where $n$ is the number of cells, $m$ is the number of target sites, and $|\Sigma| = O(2^n)$. We show that it is possible to build clade constraints that are polynomially-sized and effective in practice. Motivated by the technological challenges in producing consistent phylogenetic signal across the tree during lineage tracing, we also present algorithms to efficiently count, sample, and build consensus trees from all solutions to the clade-constrained SHP problem. In simulations, Star-CDP's strict consensus effectively reduced false positive branches while preserving many more true positives compared to the standard strict consensus implemented by PAUP*, a popular parsimony method from species phylogenetics. Likewise, Star-CDP's strict consensus achieved the same or higher accuracy (f1-score) on all but one of the 15 model conditions tested, often outperforming leading the methods, Startle-ILP and Startle-NNI, while also scaling to larger data sets than Startle-ILP. Lastly, we analyzed lineage tracing data from the KP-Tracer mouse model of lung adenocarcinoma, finding that Star-CDP produced plausible CLTs, often lowering the number of migration and reseeding events needed to explain metastases compared to Startle. Our analysis also showed, for the first time, that strategies for preprocessing cells with missing data---specifically cell pruning and deduplicating techniques---can have a substantial impact on CLTs reconstructed with the same method, even changing relative performance across methods compared to previously published results. The same was true of postprocessing trees with LAML, a maximum likelihood method designed for mixed-type missing data. By exploring these different pipelines, we recovered the most plausible CLT for the largest KP-Tracer metastatic tumor, reducing the number of reseeding events from 42 to 10 without increasing the number of migrations. Star-CDP is available on Github: https://github.com/molloy-lab/Star-CDP.

Bioinformatics

What problem does this paper attempt to address?

This paper aims to address the challenges encountered when constructing cell lineage trees (CLT) based on CRISPR lineage - tracing data. Specifically, the paper focuses on the following issues: 1. **High ratio of cells to CRISPR - induced mutations**: When the number of cells is much larger than the number of CRISPR - induced mutations, it becomes difficult to construct an accurate cell lineage tree. 2. **The problem of missing data**: Due to gene silencing or the dropout phenomenon in single - cell sequencing, some data are lost, which further increases the difficulty of constructing cell lineage trees. 3. **Limitations of existing methods**: Existing methods such as Startle - ILP and Startle - NNI perform well in some aspects, but they have problems such as high computational complexity, inability to guarantee the optimal solution, and sensitivity to low - information content and missing data. To meet these challenges, the paper proposes a new dynamic programming algorithm - Star - CDP. The main contributions of Star - CDP include: - **Introduction of a constrained search space**: By defining a set Σ consisting of cell subsets, the topological structures of possible cell lineage trees are restricted, thereby reducing the search space. - **Efficient dynamic programming algorithm**: Star - CDP can solve the constrained large - scale star - homology parsimony problem (CC - LSHP) in polynomial time, with a time complexity of O(nm|Σ|^1.726 + n|Σ|^2), where n is the number of cells, m is the number of target sites, and |Σ| is the size of the set Σ. - **Handling low - information content and missing data**: The paper proposes a series of algorithms to efficiently count, sample, and construct consensus trees to address the challenges brought by low - information content and missing data. Through these methods, Star - CDP performs well in simulation experiments, can effectively reduce false - positive branches while retaining more true - positive branches, and has higher accuracy compared to other methods such as PAUP*, Startle - ILP, and Startle - NNI. In addition, Star - CDP also shows its superiority when analyzing lung cancer metastasis data of the KP - Tracer mouse model, being able to reduce the number of replay events without increasing the number of migration events. In conclusion, by introducing a new dynamic programming algorithm and methods for handling low - information content and missing data, this paper significantly improves the accuracy and efficiency of constructing cell lineage trees based on CRISPR lineage - tracing data.

Dynamic programming algorithms for fast and accurate cell lineage tree reconstruction from CRISPR-based lineage tracing data

Startle: A star homoplasy approach for CRISPR-Cas9 lineage tracing

Maximum Likelihood Inference of Time-scaled Cell Lineage Trees with Mixed-type Missing Data

Scart: Recognizing Cell Clusters and Constructing Trajectory from Single-Cell Epigenomic Data

Tree reconstruction guarantees from CRISPR-Cas9 lineage tracing data using Neighbor-Joining

Sciphy: A Bayesian phylogenetic framework using sequential genetic lineage tracing data.

Estimation of cell lineage trees by maximum-likelihood phylogenetics

Analysis of Cell Lineage Trees by Exact Bayesian Inference Identifies Negative Autoregulation of Nanog in Mouse Embryonic Stem Cells.

Unveiling Clonal Cell Fate and Differentiation Dynamics: A Hybrid NeuralODE-Gillespie Approach

Abstract 2326: Integrating single nucleotide variants (SNVs), copy number alterations (CNAs), and structural variants (SVs) into single-cell clonal lineage inference

LinRace: cell division history reconstruction of single cells using paired lineage barcode and gene expression data

ScisTree2: An Improved Method for Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single Cell Data

Single-cell lineage tracing by integrating CRISPR-Cas9 mutations with transcriptomic data

Pharming: Joint Clonal Tree Reconstruction of SNV and CNA Evolution from Single-cell DNA Sequencing of Tumors

Single-cell phylodynamic inference of tissue development and tumor evolution with scPhyloX

A Comprehensive Evaluation of CRISPR Lineage Recorders Using TraceQC

Simulation of CRISPR-Cas9 editing on evolving barcode and accuracy of lineage tracing

Charting Single Cell Lineage Dynamics and Mutation Networks via Homing CRISPR

scTrace+: enhance the cell fate inference by integrating the lineage-tracing and multi-faceted transcriptomic similarity information

Cellular proliferation biases clonal lineage tracing and trajectory inference

Abstract 5301: SubHap: an Efficient Algorithm for Reconstructing Clonal Haplotypes of Tumor Sample from NGS Data