Abstract:The relational DBMS (RDBMS) has been widely used since it supports various high-level functionalities such as SQL, schemas, indexes, and transactions that do not exist in the O/S file system. But, a recent advent of big data technology facilitates development of new systems that sacrifice the DBMS functionality in order to efficiently manage large-scale data. Those so-called NoSQL systems use a distributed file system, which support scalability and reliability. They support scalability of the system by storing data into a large number of low-cost commodity hardware and support reliability by storing the data in replica. However, they have a drawback that they do not adequately support high-level DBMS functionality. In this paper, we propose an architecture of a DBMS that uses the DFS as storage. With this novel architecture, the DBMS is capable of supporting scalability and reliability of the DFS as well as high-level functionality of DBMS. Thus, a DBMS can utilize a virtually unlimited storage space provided by the DFS, rendering it to be suitable for big data analytics. As part of the architecture of the DBMS, we propose the notion of the meta DFS file, which allows the DBMS to use the DFS as the storage, and an efficient transaction management method including recovery and concurrency control. We implement this architecture in Odysseus/DFS, an integration of the Odysseus relational DBMS, that has been being developed at KAIST for over 24 years, with the DFS. Our experiments on transaction processing show that, due to the high-level functionality of Odysseus/DFS, it outperforms Hbase, which is a representative open-source NoSQL system. We also show that, compared with an RDBMS with local storage, the performance of Odysseus/DFS is comparable or marginally degraded, showing that the overhead of Odysseus/DFS for supporting scalability by using the DFS as the storage is not significant.

H-DB: Yet Another Big Data Hybrid System of Hadoop and DBMS

A hybrid system of Hadoop and DBMS for earthquake precursor application

Design and Implementation of Clinical Data Integration and Management System Based on Hadoop Platform

HBaseSpatial: A Scalable Spatial Data Storage Based on HBase

IMPLEMENT DATA SHARING OVER NETWORK HETEROGENEOUS DATABASES BY DATA DIMENSION REDUCTION METHOD

HMIBase： an Hierarchical Indexing System for Storing and Querying Big Data

hStorage-DB: Heterogeneity-aware Data Management to Exploit the Full Capability of Hybrid Storage Systems

A Massive Small File Storage Solution Combination of RDBMS and Hadoop

Hetero-DB：Next Generation High-Performance Database Systems by Best Utilizing Heterogeneous Computing and Storage Resources

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

The performance of MapReduce: an in-depth study

Design and Application of Bank Big Data Platform Based on Hadoop Technology

Design and Construction of a Big Data Analytics Framework for Health Applications

The Performance of MapReduce

HM: A Column-Oriented MapReduce System on Hybrid Storage

The Storage And Analytics Potential Of HBase Over The Cloud: A Survey

Odysseus/DFS: Integration of DBMS and Distributed File System for Transaction Processing of Big Data

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

AQUA+: Query Optimization for Hybrid Database-MapReduce System.

Towards a Real-Time Big Data Analytics Platform for Health Applications.

Power Big Data Analysis Platform Design Based on Hadoop