Abstract:Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.

Quantifying and Mitigating Computational Inefficiency of Genomics Data Analysis

SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds

A cost-effective approach to improving performance of big genomic data analyses in clouds

Accelerating Large-Scale Genomic Analysis With Spark

Gene Sequence Alignment on a Public Computing Platform

High-performance Genomic Analysis Framework with In-Memory Computing

The parallelism motifs of genomic data analysis

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Disaggregating Non-Volatile Memory for Throughput-Oriented Genomics Workloads

Computational Strategies for Scalable Genomics Analysis

Efficient storage and regression computation for population-scale genome sequencing studies

Parallel Accelerated Custom Correlation Coefficient Calculations for Genomics Applications

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Investigating Memory Optimization of Hash-Index for Next Generation Sequencing on Multi-Core Architecture

SparkGC: Spark based genome compression for large collections of genomes

Accelerating Genome Analysis: A Primer on an Ongoing Journey

Exploiting Parallelism for Bioinformatics Data Analysis Applications by Data Transformation Graph

Accelerating massive short reads mapping for next generation sequencing (abstract only).

An experimental study of optimizing bioinformatics applications

GenSeq+: A Scalable High-Performance Accelerator for Genome Sequencing.

Distributed Gene Clinical Decision Support System Based on Cloud Computing