A Hadoop-based Massive Molecular Data Storage Solution for Virtual Screening

Yan Zhang,Ruisheng Zhang,Qiuqiang Chen,Xiaopan Gao,Rongjing Hu,Ying Zhang,Guangcai Liu
DOI: https://doi.org/10.1109/chinagrid.2012.26
2012-01-01
Abstract:Virtual Screening involves massive computing tasks with millions of molecules docking on the targeted protein. Such data-intensive science always faces the challenge of managing tens of TB datasets, which gives rise to the requirement of large-scale storage. Furthermore, the efficient query and transmission of the large-scale datasets are the other key requirements during the virtual screening progress. Therefore, in this data-intensive application, a massive data storage solution is expected to improve the efficiency of storage and access of large-scale molecules and their docking results, as well as facilitating the data preparing and analysis phases of virtual screening. In order to address the key requirements mentioned above, we proposed a novel storage solution based on Hadoop for virtual screening. HBase was implemented as a distributed database to persist the properties of massive molecules and docking results. HDFS was utilized as a molecule source files storage system. The comparison of the system performance was also presented. Finally, we concluded that the storage solution we proposed could be considered as an alternative attempt to enable the efficient storage and access of large-scale molecules and docking results in virtual screening research.
What problem does this paper attempt to address?