Abstract:In the big data era, the distributed file system is getting more and more significant due to the characteristics of its scale-out capability, high availability, and high performance. Different distributed file systems may have different design goals. For example, some of them are designed to have good performance for small file operations, such as GlusterFS, while some of them are designed for large file operations, such as Hadoop distributed file system. With the divergence of big data applications, a distributed file system may provide good performance for some applications but fails for some other applications, that is, there has no universal distributed file system that can produce good performance for all applications. In this paper, we propose a hybrid file system framework, HybridFS, which can deliver satisfactory performance for all applications. HybridFS is composed of multiple distributed file systems with the integration of advantages of these distributed file systems. In HybridFS, on top of multiple distributed file systems, we have designed a metadata management server to perform three functions: file placement, partial metadata store, and dynamic file migration. The file placement is performed based on a decision tree. The partial metadata store is performed for files whose size is less than a few hundred Bytes to increase throughput. The dynamic file migration is performed to balance the storage usage of distributed file systems without throttling performance. We have implemented HybridFS in java on eight nodes and choose Ceph, HDFS, and GlusterFS as designated distributed file systems. The experimental results show that, in the best case, HybridFS can have up to 30% performance improvement of read/write operations over a single distributed file system. In addition, if the difference of storage usage among multiple distributed file systems is less than 40%, the performance of HybridFS is guaranteed, that is, no performance degradation.

Customized Filesystem with Dynamic Stripe Strategies on Lustre-Based Hadoop.

A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P

Evaluating Dynamic File Striping For Lustre

An Optimized Learning-Based Directory Placement Policy with Two-Rounds Selection in Distributed File Systems

Lustre, Hadoop, Accumulo

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture

New Lustre features to improve Lustre metadata and small‐file performance

A Quantitative Approach to Architecting All-Flash Lustre File Systems

HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers

Combining Buffered I/O and Direct I/O in Distributed File Systems.

Optimizing the parameters of the Lustre-file-system-based HPC system for reverse time migration

NVMM-Oriented Hierarchical Persistent Client Caching for Lustre

Output Performance Study on a Production Petascale Filesystem.

Zput: A speedy data uploading approach for the Hadoop Distributed File System

Performance Comparison of DAOS and Lustre for Object Data Storage Approaches

HybridFS - A High Performance and Balanced File System Framework with Multiple Distributed File Systems

Auditing Lustre file system

Accelerating Big Data Applications on Tiered Storage System with Various Eviction Policies.

HARL: Optimizing Parallel File Systems with Heterogeneity-Aware Region-Level Data Layout

Dynamic Data Storage and Management Strategies for Distributed File System

Design and Implementation of an Asymmetric Block-Based Parallel File System