Abstract:With the development of sequencing techniques, a growing number of data has been generated, especially the sequencing data."Big data era" is a coined term from the internet and computer science.However, it is also popular in the bioinformatics researches due to the big scholar fundings, such as Human Genome Program, Human Microbiome Project, 1000 Genomes, etc.We would like to discuss the bioinformatics big data problems, and share some of our comments and perspectives.First of all, we should discuss the next generation sequencing data.Massive short reads are generated from the latest sequencers, including Illumina, Hiseq, Roche 454, etc.They should be mapped into the reference genomes, or assembled with ab initio techniques.Although related software tools have been developed for decades, single thread or stand-alone computer program could not satisfy the users.Parallel platforms are being employed for the massive data, such as Hadoop [1, 2], Spark [3,4], Cuda [5], MIC [6], etc.However, most of these paralleled works focused on the alignment problems, including multiple sequence alignment and sequencing reads mapping.Ab initio assembling problem is neglected because big graphs are difficult to handle in parallel.Big graphs always require huge memory and it is an uneasy process to divide and conquer.Therefore, it is suggested that traditional progressive assembling with parallel mechanism would be an interesting direction, where the big graph problem can be resolved.Big data are difficult to handle for limited memory and low level configuration computers.Therefore, it is suggested that the bioinformatics researchers should develop more fast algorithms and parallel programs.However, sometimes they are gospel for coders.Some of the benefits of big data is deep learning.As the genomic sequences have grown, researchers could employ deep learning and Convolutional Neural Network (CNN) to solve the DNA/protein sequence feature extraction problem [7].The deep learning techniques are used in protein subcellular localization [8], DNA/RNA/protein modification prediction [9, 10], drug and target prediction [11], etc.A number of surveys [12] have been conducted on the deep learning in bioinformatics.Besides deep learning and parallel computation, traditional bioinformatics researches also generate interesting points, as shown in our special issue before, including genome reconstruction [13], protein submitochondrial locations [14], and noncoding variants functional prioritization [15].Traditional techniques, such as support vector machine, are not weeded out for big data.It is not easy to assess how much data belong to big data.Therefore, it is assumed that there are some overfitting works that claimed to be big data and employed deep learning related techniques arbitrarily.In the future, researchers should pay more attention on the overfitting problem in the big data era.Moreover, visualization is also an essential topic for big data research, including sequence alignment, network illustration, sample distribution, molecular space simulation, etc.The future of big data seems to be promising for the bioinformatics researches.

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Ieee Access Special Section Editorial: Advanced Data Analytics For Large-Scale Complex Data Environments

Algorithmic and Statistical Challenges in Modern Large-Scale Data Analysis are the Focus of MMDS 2008

Statistical Methods and Computing for Big Data

Big Data Analytics in Bioinformatics: A Machine Learning Perspective

Statistical Methods for the Analysis of Genomic Data

Challenges of Big Data Analysis

United Statistical Algorithm, Small and Big Data: Future OF Statistician

Statistical Validity and Consistency of Big Data Analytics: A General Framework

Big Data, Big Challenges

Algorithmic Data Analytics, Small Data Matters and Correlation versus Causation

Rethinking Abstractions for Big Data: Why, Where, How, and What

Mining emerging massive scientific sequence data using block-wise decomposition methods

Advances in Data Analysis: Theory and Applications to Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks

Big data in medical science--a biostatistical view

The Value of Using Big Data Technologies in Computational Social Science

Perspectives of Bioinformatics in Big Data Era.

The algebra and machine representation of statistical models

High dimensional statistical inference: theoretical development to data analytics

Changes from Classical Statistics to Modern Statistics and Data Science