Abstract:With the development of sequencing techniques, a growing number of data has been generated, especially the sequencing data."Big data era" is a coined term from the internet and computer science.However, it is also popular in the bioinformatics researches due to the big scholar fundings, such as Human Genome Program, Human Microbiome Project, 1000 Genomes, etc.We would like to discuss the bioinformatics big data problems, and share some of our comments and perspectives.First of all, we should discuss the next generation sequencing data.Massive short reads are generated from the latest sequencers, including Illumina, Hiseq, Roche 454, etc.They should be mapped into the reference genomes, or assembled with ab initio techniques.Although related software tools have been developed for decades, single thread or stand-alone computer program could not satisfy the users.Parallel platforms are being employed for the massive data, such as Hadoop [1, 2], Spark [3,4], Cuda [5], MIC [6], etc.However, most of these paralleled works focused on the alignment problems, including multiple sequence alignment and sequencing reads mapping.Ab initio assembling problem is neglected because big graphs are difficult to handle in parallel.Big graphs always require huge memory and it is an uneasy process to divide and conquer.Therefore, it is suggested that traditional progressive assembling with parallel mechanism would be an interesting direction, where the big graph problem can be resolved.Big data are difficult to handle for limited memory and low level configuration computers.Therefore, it is suggested that the bioinformatics researchers should develop more fast algorithms and parallel programs.However, sometimes they are gospel for coders.Some of the benefits of big data is deep learning.As the genomic sequences have grown, researchers could employ deep learning and Convolutional Neural Network (CNN) to solve the DNA/protein sequence feature extraction problem [7].The deep learning techniques are used in protein subcellular localization [8], DNA/RNA/protein modification prediction [9, 10], drug and target prediction [11], etc.A number of surveys [12] have been conducted on the deep learning in bioinformatics.Besides deep learning and parallel computation, traditional bioinformatics researches also generate interesting points, as shown in our special issue before, including genome reconstruction [13], protein submitochondrial locations [14], and noncoding variants functional prioritization [15].Traditional techniques, such as support vector machine, are not weeded out for big data.It is not easy to assess how much data belong to big data.Therefore, it is assumed that there are some overfitting works that claimed to be big data and employed deep learning related techniques arbitrarily.In the future, researchers should pay more attention on the overfitting problem in the big data era.Moreover, visualization is also an essential topic for big data research, including sequence alignment, network illustration, sample distribution, molecular space simulation, etc.The future of big data seems to be promising for the bioinformatics researches.

Hadoop Applications in Bioinformatics

Survey of Mapreduce Frame Operation in Bioinformatics

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Bioinformatics Applications on Apache Spark

Bioinformatics applications on Apache Spark.

When cloud computing meets bioinformatics: a review.

Perspectives of Bioinformatics in Big Data Era.

An Experimental Study of a Biosequence Big Data Analysis Service

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Cloud Based Short Read Mapping Service

Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges

The High Performance Computing Applications for Bioinformatics Research.

A Distributed Parallel Computing Environment for Bioinformatics Problems

Typical Applications on Data Mining Technology in Bioinformatics

Analysis of Big Data Platform with OpenStack and Hadoop.

Bioinformatics clouds for big data manipulation

Computational Strategies for Scalable Genomics Analysis

A Study on the Applications of Data Mining Technology in Bio-information

Combining Hadoop with MPI to Solve Metagenomics Problems That Are Both Data- and Compute-intensive

Research and Application of Key Technology of Hadoop

Optimization Analysis of Hadoop.