Abstract:Background: New high throughput pyrosequencers such as the 454 Life Sciences GS 20 are capable of massively parallelizing DNA sequencing providing an unprecedented rate of output data as well as potentially reducing costs. However, these new pyrosequencers bear a different error profile and provide shorter reads than those of a more traditional Sanger sequencer. These facts pose new challenges regarding how the data are handled and analyzed, in addition, the steep increase in the sequencers throughput calls for much computation power at a low cost. Results: To address these challenges, we created an automated multi-step computation pipeline integrated with a database storage system. This allowed us to store, handle, index and search (1) the output data from the GS20 sequencer (2) analysis projects, possibly multiple on every dataset (3) final results of analysis computations (4) intermediate results of computations (these allow hand-made comparisons and hence further searches by the biologists). Repeatability of computations was also a requirement. In order to access the needed computation power, we ported the pipeline to the European Grid: a large community of clusters, load balanced as a whole. In order to better achieve this Grid port we created Vnas: an innovative Grid job submission, virtual sandbox manager and job callback framework. After some runs of the pipeline aimed at tuning the parameters and thresholds for optimal results, we successfully analyzed 273 sequenced amplicons from a cancerous human sample and correctly found punctual mutations confirmed by either Sanger resequencing or NCBI dbSNP. The sequencing was performed with our 454 Life Sciences GS 20 pyrosequencer. Conclusion: We handled the steep increase in throughput from the new pyrosequencer by building an automated computation pipeline associated with database storage, and by leveraging the computing power of the European Grid. The Grid platform offers a very cost effective choice for uneven workloads, typical in many scientific research fields, provided its peculiarities can be accepted (these are discussed). The mentioned infrastructure was used to analyze human amplicons for mutations. More analyses will be performed in the future.

A graphical, interactive and GPU-enabled workflow to process long-read sequencing data

NanoSPC: a scalable, portable, cloud compatible viral nanopore metagenomic data processing pipeline

Readfish enables targeted nanopore sequencing of gigabase-sized genomes

MAGI: a Node.js web service for fast microRNA-Seq analysis in a GPU infrastructure

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

DNAscan: a fast, computationally and memory efficient bioinformatics pipeline for the analysis of DNA next-generation-sequencing data

Data handling strategies for high throughput pyrosequencers

Metapipeline-DNA: A Comprehensive Germline & Somatic Genomics Nextflow Pipeline

Nanopore adaptive sequencing for mixed samples, whole exome capture and targeted panels

Accelerating K-mer Frequency Counting with GPU and Non-Volatile Memory

mm2-gb: GPU Accelerated Minimap2 for Long Read DNA Mapping

A Robust Parallel Computing Data Extraction Framework for Nanopore Experiments

Efficient real-time selective genome sequencing on resource-constrained devices

Streamlining remote nanopore data access with slow5curl

A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset

Accelerating Minimap2 for Accurate Long Read Alignment on GPUs

A cloud-based workflow to quantify transcript-expression levels in public cancer compendia

A Fully Integrated End-to-End Genome Analysis Accelerator for Next-Generation Sequencing

Harmonizing and integrating the NCI Genomic Data Commons through accessible, interactive, and cloud-enabled workflows

Accelerating massive short reads mapping for next generation sequencing (abstract only).