Abstract:Abstract Background: It is now possible to interrogate thousands of cells in a single experiment for studying genetic variability with the advancements in single-cell sequencing technologies. Single-cell DNA platforms like Tapestri is still susceptible to errors from polymerase incorporations, structure induced template switching, PCR mediated recombination in Tapestri workflow or DNA-damage. Errors from sequencing could propagate from cluster amplification, cycle sequencing or image analysis. All together these errors can be divided into substitutions, insertions and deletion errors and can range from 0.5% to 2% depending on the sequencer. This makes rare variant and minimal residual disease detection challenging. To address these challenges, we developed deep learning models for correcting the errors, reduce false-positive rates and predict true variants. Method: First we build a consensus sequence from several reads to predict the correct sequence. The initial layers learn the motifs and local sequence contexts in classifying the patterns. The output of this network is a probability distribution over possible bases and the prediction is the base with highest probability. The bases in the reads are subsequently corrected to the predicted base from the first step model. After error correcting the reads, we used the variants called by Genome Analysis Toolkit to feed into a multi-class classifier network. Our features consists of percent of cells mutated, and the different genotype features including depth, AF and quality of each variant in these cells. The truth labels are generated using tapestri instrument from multiple experiments with known bulk truth. We trained the network on over 200k cells from 13 samples and tested on a larger set of samples. Class imbalance was handled using upsampling the truth data. Our training samples include diverse samples from cell mixtures at various dilution uptill 0.1% and clinical samples processed through tapestri instrument and sequenced on a diverse set of sequencers including miseq and novaseq. Conclusion: To validate this method, we used two different targeted panels on a Latin square model system with known truth mutations. With our 2-step workflow using error correction and variant prediction model, we significantly improved our median PPV 2-3 fold at 0.5% LOD while maintaining the sensitivity. We are further optimizing the model by adding more training samples and feature optimization. Citation Format: Manimozhi Manivannan, Sombeet Sahu, Kim Dong, Shu Wang, Saurabh Gulati, Saurabh Parikh, Nigel Beard, Anup Parikh. Improvements in variant calling sensitivity and specificity in single-cell DNA sequencing using deep learning [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 861.

Deep learning uncovers sequence-specific amplification bias in multi-template PCR

Prediction of PCR amplification from primer and template sequences using recurrent neural network

Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR

Adaptive Filtering Framework to Remove Nonspecific and Low-Efficiency Reactions in Multiplex Digital PCR Based on Sigmoidal Trends

Abstract 861: Improvements in variant calling sensitivity and specificity in single-cell DNA sequencing using deep learning

Quantitating primer-template interactions using deconstructed PCR

Classification of bacterial plasmid and chromosome derived sequences using machine learning

The Origin of Biased Sequence Depth in Sequence-Independent Nucleic Acid Amplification and Optimization for Efficient Massive Parallel Sequencing

Understanding PCR Processes to Draw Meaningful Conclusions from Environmental DNA Studies

Competitive Amplification Networks enable molecular pattern recognition with PCR

eDNAssay: A machine learning tool that accurately predicts qPCR cross‐amplification

Similar Color Analysis Based on Deep Learning (SCAD) for Multiplex Digital PCR Via a Single Fluorescent Channel.

Accuracy and data efficiency in deep learning models of protein expression

Developing a Machine Learning 'Smart' Polymerase Chain Reaction Thermocycler Part 2: Putting the Theoretical Framework into Practice

Developing a Machine-Learning 'Smart' PCR Thermocycler, Part 1: Construction of a Theoretical Framework

EpiTEAmDNA: Sequence feature representation via transfer learning and ensemble learning for identifying multiple DNA epigenetic modification types across species

Evolution-Aware Deep Reinforcement Learning for Single-Cell DNA Copy Number Calling

Effective training of nanopore callers for epigenetic marks with limited labelled data

Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network

Investigation of Linear Amplification Using Abasic Site-Containing Primers Coupled to Routine STR Typing for LT-DNA Analysis.

Information theory-based algorithm for in silico prediction of PCR products with whole genomic sequences as templates