Perspectives of Bioinformatics in Big Data Era.
Maozu Guo,Quan Zou
DOI: https://doi.org/10.2174/138920292002190422120915
2019-01-01
Current Genomics
Abstract:With the development of sequencing techniques, a growing number of data has been generated, especially the sequencing data."Big data era" is a coined term from the internet and computer science.However, it is also popular in the bioinformatics researches due to the big scholar fundings, such as Human Genome Program, Human Microbiome Project, 1000 Genomes, etc.We would like to discuss the bioinformatics big data problems, and share some of our comments and perspectives.First of all, we should discuss the next generation sequencing data.Massive short reads are generated from the latest sequencers, including Illumina, Hiseq, Roche 454, etc.They should be mapped into the reference genomes, or assembled with ab initio techniques.Although related software tools have been developed for decades, single thread or stand-alone computer program could not satisfy the users.Parallel platforms are being employed for the massive data, such as Hadoop [1, 2], Spark [3,4], Cuda [5], MIC [6], etc.However, most of these paralleled works focused on the alignment problems, including multiple sequence alignment and sequencing reads mapping.Ab initio assembling problem is neglected because big graphs are difficult to handle in parallel.Big graphs always require huge memory and it is an uneasy process to divide and conquer.Therefore, it is suggested that traditional progressive assembling with parallel mechanism would be an interesting direction, where the big graph problem can be resolved.Big data are difficult to handle for limited memory and low level configuration computers.Therefore, it is suggested that the bioinformatics researchers should develop more fast algorithms and parallel programs.However, sometimes they are gospel for coders.Some of the benefits of big data is deep learning.As the genomic sequences have grown, researchers could employ deep learning and Convolutional Neural Network (CNN) to solve the DNA/protein sequence feature extraction problem [7].The deep learning techniques are used in protein subcellular localization [8], DNA/RNA/protein modification prediction [9, 10], drug and target prediction [11], etc.A number of surveys [12] have been conducted on the deep learning in bioinformatics.Besides deep learning and parallel computation, traditional bioinformatics researches also generate interesting points, as shown in our special issue before, including genome reconstruction [13], protein submitochondrial locations [14], and noncoding variants functional prioritization [15].Traditional techniques, such as support vector machine, are not weeded out for big data.It is not easy to assess how much data belong to big data.Therefore, it is assumed that there are some overfitting works that claimed to be big data and employed deep learning related techniques arbitrarily.In the future, researchers should pay more attention on the overfitting problem in the big data era.Moreover, visualization is also an essential topic for big data research, including sequence alignment, network illustration, sample distribution, molecular space simulation, etc.The future of big data seems to be promising for the bioinformatics researches.