Abstract 861: Improvements in variant calling sensitivity and specificity in single-cell DNA sequencing using deep learning

Manimozhi Manivannan,Sombeet Sahu,Kim Dong,Shu Wang,Saurabh Gulati,Saurabh Parikh,Nigel Beard,Anup Parikh
DOI: https://doi.org/10.1158/1538-7445.am2020-861
2020-08-13
Abstract:Abstract Background: It is now possible to interrogate thousands of cells in a single experiment for studying genetic variability with the advancements in single-cell sequencing technologies. Single-cell DNA platforms like Tapestri is still susceptible to errors from polymerase incorporations, structure induced template switching, PCR mediated recombination in Tapestri workflow or DNA-damage. Errors from sequencing could propagate from cluster amplification, cycle sequencing or image analysis. All together these errors can be divided into substitutions, insertions and deletion errors and can range from 0.5% to 2% depending on the sequencer. This makes rare variant and minimal residual disease detection challenging. To address these challenges, we developed deep learning models for correcting the errors, reduce false-positive rates and predict true variants. Method: First we build a consensus sequence from several reads to predict the correct sequence. The initial layers learn the motifs and local sequence contexts in classifying the patterns. The output of this network is a probability distribution over possible bases and the prediction is the base with highest probability. The bases in the reads are subsequently corrected to the predicted base from the first step model. After error correcting the reads, we used the variants called by Genome Analysis Toolkit to feed into a multi-class classifier network. Our features consists of percent of cells mutated, and the different genotype features including depth, AF and quality of each variant in these cells. The truth labels are generated using tapestri instrument from multiple experiments with known bulk truth. We trained the network on over 200k cells from 13 samples and tested on a larger set of samples. Class imbalance was handled using upsampling the truth data. Our training samples include diverse samples from cell mixtures at various dilution uptill 0.1% and clinical samples processed through tapestri instrument and sequenced on a diverse set of sequencers including miseq and novaseq. Conclusion: To validate this method, we used two different targeted panels on a Latin square model system with known truth mutations. With our 2-step workflow using error correction and variant prediction model, we significantly improved our median PPV 2-3 fold at 0.5% LOD while maintaining the sensitivity. We are further optimizing the model by adding more training samples and feature optimization. Citation Format: Manimozhi Manivannan, Sombeet Sahu, Kim Dong, Shu Wang, Saurabh Gulati, Saurabh Parikh, Nigel Beard, Anup Parikh. Improvements in variant calling sensitivity and specificity in single-cell DNA sequencing using deep learning [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 861.
What problem does this paper attempt to address?