Abstract:Background: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results: In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions: The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

Implementing Suffix Array Algorithm Using Apache Big Table Data Implementation

A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform

Analyzing large-scale DNA Sequences on Multi-core Architectures

Massive Genomic Data Processing and Deep Analysis

Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System

DGST: Efficient and Scalable Suffix Tree Construction on Distributed Data-Parallel Platforms.

Big Data Technology Accelerate Genomics Precision Medicine

DNA-SaM, a robust system for large-scale data storage

ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics

DNACloud: A Potential Tool for storing Big Data on DNA

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

The Power of Word-Frequency Based Alignment-Free Functions: a Comprehensive Large-scale Experimental Analysis -- Version 3

An advanced approach for DNA sequencing and similarities analysis on the basis of groupings of nucleotide bases

Sparksw: Scalable Distributed Computing System For Large-Scale Biological Sequence Alignment

DNAscan: a fast, computationally and memory efficient bioinformatics pipeline for the analysis of DNA next-generation-sequencing data

Prefix-free graphs and suffix array construction in sublinear space

Content-based filter queries on DNA data storage systems

High dimensional biological data retrieval optimization with NoSQL technology