Abstract:With the development of sequencing techniques, a growing number of data has been generated, especially the sequencing data."Big data era" is a coined term from the internet and computer science.However, it is also popular in the bioinformatics researches due to the big scholar fundings, such as Human Genome Program, Human Microbiome Project, 1000 Genomes, etc.We would like to discuss the bioinformatics big data problems, and share some of our comments and perspectives.First of all, we should discuss the next generation sequencing data.Massive short reads are generated from the latest sequencers, including Illumina, Hiseq, Roche 454, etc.They should be mapped into the reference genomes, or assembled with ab initio techniques.Although related software tools have been developed for decades, single thread or stand-alone computer program could not satisfy the users.Parallel platforms are being employed for the massive data, such as Hadoop [1, 2], Spark [3,4], Cuda [5], MIC [6], etc.However, most of these paralleled works focused on the alignment problems, including multiple sequence alignment and sequencing reads mapping.Ab initio assembling problem is neglected because big graphs are difficult to handle in parallel.Big graphs always require huge memory and it is an uneasy process to divide and conquer.Therefore, it is suggested that traditional progressive assembling with parallel mechanism would be an interesting direction, where the big graph problem can be resolved.Big data are difficult to handle for limited memory and low level configuration computers.Therefore, it is suggested that the bioinformatics researchers should develop more fast algorithms and parallel programs.However, sometimes they are gospel for coders.Some of the benefits of big data is deep learning.As the genomic sequences have grown, researchers could employ deep learning and Convolutional Neural Network (CNN) to solve the DNA/protein sequence feature extraction problem [7].The deep learning techniques are used in protein subcellular localization [8], DNA/RNA/protein modification prediction [9, 10], drug and target prediction [11], etc.A number of surveys [12] have been conducted on the deep learning in bioinformatics.Besides deep learning and parallel computation, traditional bioinformatics researches also generate interesting points, as shown in our special issue before, including genome reconstruction [13], protein submitochondrial locations [14], and noncoding variants functional prioritization [15].Traditional techniques, such as support vector machine, are not weeded out for big data.It is not easy to assess how much data belong to big data.Therefore, it is assumed that there are some overfitting works that claimed to be big data and employed deep learning related techniques arbitrarily.In the future, researchers should pay more attention on the overfitting problem in the big data era.Moreover, visualization is also an essential topic for big data research, including sequence alignment, network illustration, sample distribution, molecular space simulation, etc.The future of big data seems to be promising for the bioinformatics researches.

Big Data Technology Accelerate Genomics Precision Medicine

Massive Genomic Data Processing and Deep Analysis

Computational Strategies for Scalable Genomics Analysis

How Big Data and High-performance Computing Drive Brain Science

The BIG Data Center: from deposition to integration to translation

Artificial Intelligence, Physiological Genomics, and Precision Medicine.

Genomics and Biological Big Data: Facing Current and Future Challenges around Data and Software Sharing and Reproducibility

Perspectives of Bioinformatics in Big Data Era.

Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges

[The BIG Data Center's database resources].

Advancing clinical cohort selection with genomics analysis on a distributed platform

Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis, and beyond

Use of big data in drug development for precision medicine: an update

High Performance Computational Biology and Drug Design on TianHe Supercomputers.

Accelerating Large-Scale Genomic Analysis With Spark

Big Data access and infrastructure for modern biology: case studies in data repository utility

Big Data, Big Challenges

Quantifying and Mitigating Computational Inefficiency of Genomics Data Analysis

Review of Bioinformatics Application Using Intel MIC

DNA-SaM, a robust system for large-scale data storage

Big Data Challenges in Genome Informatics