Big Data Technology Accelerate Genomics Precision Medicine

Hao Li
DOI: https://doi.org/10.48550/arXiv.1701.09045
2017-01-29
Abstract:During genomics life science research, the data volume of whole genomics and life science algorithm is going bigger and bigger, which is calculated as TB, PB or EB etc. The key problem will be how to store and analyze the data with optimized way. This paper demonstrates how Intel Big Data Technology and Architecture help to facilitate and accelerate the genomics life science research in data store and utilization. Intel defines high performance GenomicsDB for variant call data query and Lustre filesystem with Hierarchal Storage Management for genomics data store. Based on these great technology, Intel defines genomics knowledge share and exchange architecture, which is landed and validated in BGI China and Shanghai Children Hospital with very positive feedback. And these big data technology can definitely be scaled to much more genomics life science partners in the world.
Databases
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: In genomics and life - science research, with the continuous growth of data volume (reaching the terabyte (TB), petabyte (PB) or even exabyte (EB) level), how to efficiently store, analyze, share and utilize these big data. Specifically: 1. **Data storage problems**: - The amount of data generated by genome sequencing is huge. For example, the sequencing data of each patient exceeds 1TB. In 2015, 1.65 million new patients in the United States generated more than 4EB of data. - China National GeneBank (CNGB) currently has 500PB of data and increases by 5 - 10PB every year. - Shanghai Children's Hospital and Shanghai Jiao Tong University Supercomputing Center have deployed hundreds of nodes with a total storage capacity of 30PB. 2. **Data analysis problems**: - How to optimize the storage and analysis of these large - scale data to support efficient genomics research. 3. **Data sharing and exchange problems**: - How to realize the sharing and exchange of genomic knowledge among different institutions while ensuring privacy. For this reason, the paper proposes Intel's big - data technology and architecture solutions, including: - **GenomicsDB**: A database engine for quickly querying variant call data, optimizing sparse array storage. - **Lustre file system**: Combined with hierarchical storage management (HSM), it is used for efficient management and expansion of genomic data storage. - **Genomic knowledge - sharing architecture**: Through the central function module, it realizes the secure statistical and summary sharing of genomic data. These technologies have been verified in BGI China and Shanghai Children's Hospital and have received positive feedback.