Abstract:With the development of next-generation sequencing (NGS), DNA/RNA sequencing has become cheaper and more efficient. Today, a whole human genome can be sequenced under $1,000, providing opportunities for large-scale bioinformatic analysis on big datasets. However, most of existing bioinformatic analysis tools are programmed for single server based computing platform and not suitable to process such big datasets. As Hadoop MapReduce and Spark are gaining popularity as cluster computing based big data processing platform, more and more bioinformatic applications start to explore cluster computing platform for large scale data analysis. In this paper we present an in-depth experimental study on deploying Spark clusters for high performance bioinformatic short sequence reconstruction. Our experimental results enable us to answer a number of challenging and yet most frequently asked questions regarding efficient management of bioinformatic data analysis services on Spark systems. Example questions include how to best split big dataset into multiple partitions, and how to distribute data partitions and bioinformatic analysis tasks on a Spark cluster for carrying out a high performance distributed analysis job? What types of memory models are effective for bioinformatic data analysis services on a Spark cluster? Why do different bioinformatic data analysis operations exhibit different throughput performance on the same Spark cluster? We conjecture that this experimental study not only demonstrates the feasibility of high performance bioinformatic data analysis on Spark platform, but also will help bioinformatic application developers to make more informed decisions on both design and configuration of Spark Cluster, managing and tuning parameters of Spark runtime system for enhancing the performance of large scale big data analytics.

Metaspark: A Spark-Based Distributed Processing Tool to Recruit Metagenomic Reads to Reference Genomes

SOAPMetaS: profiling large metagenome datasets efficiently on distributed clusters

Analysis of Plant Breeding on Hadoop and Spark

An Experimental Study of a Biosequence Big Data Analysis Service

Bioinformatics Applications on Apache Spark

Bioinformatics applications on Apache Spark.

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

Accelerating Large-Scale Genomic Analysis With Spark

scSparkXMBD - High-Performance scRNA-seq Data Processing with Spark.

Distributed data analysis and processing platform based on Pig_Spark

Parallel Read Partitioning for Concurrent Assembly of Metagenomic Data

Parallel-META: A high-performance computational pipeline for metagenomic data analysis

deSPI: efficient classification of metagenomic reads with lightweight de Bruijn graph-based reference indexing

Despi: Efficient Classification of Metagenomics Reads with Lightweight De Bruijn Graph-Based Reference Indexing

FR-HIT, a very fast program to recruit metagenomic reads to homologous reference genomes.

VariantSpark: population scale clustering of genotype information

Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization

MetaKSSD: Boosting the Scalability of Reference Taxonomic Marker Database and the Performance of Metagenomic Profiling Using Sketch Operations

Sparksw: Scalable Distributed Computing System For Large-Scale Biological Sequence Alignment

Scalable and Parallel Sequential Pattern Mining Using Spark

Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform