Abstract:e13002 Background: Next generation sequencing (NGS) technologies have already shown numerous advances to revolutionize our understanding of cancer genomic profiling and improve cancer treatments. There have been many NGS data analysis tools available for identification of different genomic alternations including short insertion and deletion (short indel, < 25 bp in general). However, detection of > 100 bp large indel (L-indel) from short reads (generally < 200 bp) remains a huge challenge. L-indels identified in genes like MET and FLT3 have proven a critical implication in cancer treatments. Moreover, there is an urgent need for an algorithm to validate L-indels generated by genomic modification systems like ZFN, TALENs and CRISPR/Cas9. Methods: A novel algorithm was developed for calling L-indels in targeted sequencing data as following: raw reads were first filtered by selecting high-quality ones and correcting wrong bases; a chunk of contig (unitig) was then assembled and aligned to reference genome; lastly, break point information was collected and L-indels were calculated. The algorithm was applied on the reads generated from NA12878 cell line and FFPE samples collected in our lab respectively on the Illumina platform. Validation were performed by PCR and Sanger sequencing. Results: 22 novel exonic L-indels (17 deletions and 5 insertions) were identified with a median size of 1,616 bp (range: 25-6,684 bp) from NA12878 sequencing data and 100% successfully confirmed by Sanger sequencing. In addition, 6 out of 9 reported L-indels were also found with the rest awaiting for further exploration. Strikingly, a 2,446 bp deletion on MSH6, encodes an important component in mismatch repair (MMR) system, was detected on a FFPE sample of a lung cancer adenocarcinoma, which prompted to consideration of MMR deficiency otherwise. Conclusions: We have developed and validated a novel and accurate method for NGS large indels detection dedicated for targeted sequencing data in clinical cancer setting. Equipment of this method will greatly increase the capability of comprehensively understanding genomic alterations from a single NGS-based assay, and provide more information for potential clinical use.

Sensitive Long-Indel-Aware Alignment of Sequencing Reads

An efficient Burrows-Wheeler transform-based aligner for short read mapping

Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data

Fast and Accurate Read Alignment for Resequencing.

A Re-Sequencing Tool For High Mismatch-Tolerant Short Read Alignment Based On Burrows-Wheeler Transform

Sap-A Sequence Mapping And Analyzing Program For Long Sequence Reads Alignment And Accurate Variants Discovery

A Novel Approach to Detect Large Indels from Targeted Sequencing Data in Clinical Cancer Setting

GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Acceleration of short and long DNA read mapping without loss of accuracy using suffix array

HQAlign: Aligning nanopore reads for SV detection using current-level modeling

Benchmarking long-read aligners and SV callers for structural variation detection in Oxford nanopore sequencing data

Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS.

CLEVER: Clique-Enumerating Variant Finder

Fast and accurate long-read alignment with Burrows–Wheeler transform

Improved indel detection in DNA and RNA via realignment with ABRA2

A Novel Multi-Alignment Pipeline for High-Throughput Sequencing Data.

SvABA: genome-wide detection of structural variants and indels by local assembly

Fast and accurate short read alignment with hybrid hash-tree data structure

Comparative evaluation of SNVs, indels, and structural variations detected with short- and long-read sequencing data

A fast read alignment method based on seed-and-vote for next generation sequencing