Abstract:With the development of next-generation sequencing (NGS), DNA/RNA sequencing has become cheaper and more efficient. Today, a whole human genome can be sequenced under $1,000, providing opportunities for large-scale bioinformatic analysis on big datasets. However, most of existing bioinformatic analysis tools are programmed for single server based computing platform and not suitable to process such big datasets. As Hadoop MapReduce and Spark are gaining popularity as cluster computing based big data processing platform, more and more bioinformatic applications start to explore cluster computing platform for large scale data analysis. In this paper we present an in-depth experimental study on deploying Spark clusters for high performance bioinformatic short sequence reconstruction. Our experimental results enable us to answer a number of challenging and yet most frequently asked questions regarding efficient management of bioinformatic data analysis services on Spark systems. Example questions include how to best split big dataset into multiple partitions, and how to distribute data partitions and bioinformatic analysis tasks on a Spark cluster for carrying out a high performance distributed analysis job? What types of memory models are effective for bioinformatic data analysis services on a Spark cluster? Why do different bioinformatic data analysis operations exhibit different throughput performance on the same Spark cluster? We conjecture that this experimental study not only demonstrates the feasibility of high performance bioinformatic data analysis on Spark platform, but also will help bioinformatic application developers to make more informed decisions on both design and configuration of Spark Cluster, managing and tuning parameters of Spark runtime system for enhancing the performance of large scale big data analytics.

Performance Analysis of Clustering Algorithm under Two Kinds of Big Data Architecture.

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

A Performance Comparison of Big Data Processing Platform Based on Parallel Clustering Algorithms

Computational Performance Analysis of Cluster-based Technologies for Big Data Analytics

Performance Comparison of Clustering Algorithms in Spark

A Parallel Clustering Algorithm for Power Big Data Analysis.

Performance Analysis of Distributed Computing Frameworks for Big Data Analytics: Hadoop Vs Spark

Comparative Study on MapReduce and Spark for Big Data Analytics

Comparative Study of Apache Spark MLlib Clustering Algorithms

An Improved K-means Distributed Clustering Algorithm Based on Spark Parallel Computing Framework

Benchmarking of Distributed Computing Engines Spark and GraphLab for Big Data Analytics

Research on Retailer Data Clustering Algorithm Based on Spark

Optimized Big Data K-means Clustering Using MapReduce

Performance Analysis for Clustering Algorithms

A 2-Tier Clustering Algorithm with Map-Reduce

Performance Evaluation of Simple K-Mean and Parallel K-Mean Clustering Algorithms: Big Data Business Process Management Concept

Data Mining Algorithm for Cloud Network Information Based on Artificial Intelligence Decision Mechanism

An Experimental Study of a Biosequence Big Data Analysis Service

K-means+: A Developed Clustering Algorithm for Big Data.

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Using Traditional Data Analysis Algorithms To Detect Access Patterns For Big Data Processing