Abstract:Many massive data processing applications nowadays often need long, continuous, and uninterrupted data accesses. Distributed file systems are used as the back-end storage to provide the global namespace management and reliability guarantee. Due to increasing hardware failures and software issues with the growing system scale, metadata service reliability has become a critical issue as it has a direct impact on file and directory operations. Existing metadata management mechanisms can provide fault tolerance capability to some level but are inadequate. They often have limitations in system availability, state consistence, and performance overhead and lack an effective mechanism to offer metadata reliability. This paper introduces a novel highly reliable metadata service to address these issues in large-scale file systems. Different from traditional strategies, this proposed reliable metadata service adopts a new active-standby architecture for fault tolerance and uses a holistic approach to improve file system availability. A new shared storage pool (SSP) is designed for transparent metadata synchronization and replication between active and standby servers. Based on the SSP, a new policy called multiple actives multiple standbys (MAMS) is presented to perform metadata service recovery in case of failures. A new global state recovery strategy and a smart client fault tolerance mechanism are achieved to maintain the continuity of metadata service. We have implemented such highly reliable metadata service in a prototype file system CFS (Clover file system) and conducted extensive tests to evaluate it. Experimental results confirm that it can significantly improve file system reliability with fast failover under different failure scenarios while having negligible influence on performance. Compared with typical reliability designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean-time-to-recovery (MTTR) with the highly reliable metadata service was reduced by 80.23, 65.46 and 28.13 percent, respectively.

An Adaptive Metadata Management Scheme Based on Deep Reinforcement Learning for Large-Scale Distributed File Systems

DEAM：Decoupled, Expressive, Area-Efficient Metadata Cache

A Scalable, Adaptive, Self-management and Fault-Tolerant Architecture for Digital Library

Efficient Dynamic Management of Distributed Metadata

Research and Design of MDS in Distributed Storage System

A Highly Reliable Metadata Service for Large-Scale Distributed File Systems

Dynamic hashing: Adaptive metadata management for petabyte-scale file systems

CARD: A Congestion-Aware Request Dispatching Scheme for Replicated Metadata Server Cluster.

Efficient Hierarchical Storage Management Framework Empowered by Reinforcement Learning

Design and Implementation of Metadata Management System

Distributed Metadata Management Based on Hierarchical Bloom Filters in Data Grid

ICCG: low-cost and efficient consistency with adaptive synchronization for metadata replication

DMADRL: A Distributed Multi-agent Deep Reinforcement Learning Algorithm for Cognitive Offloading in Dynamic MEC Networks

MECC: A Mobile Edge Collaborative Caching Framework Empowered by Deep Reinforcement Learning

Data Management Across Geographically Distributed Autonomous Systems: Architecture, Implementation, and Performance Evaluation.

$λ$FS: A Scalable and Elastic Distributed File System Metadata Service using Serverless Functions

Efficient Search Using Adaptive Metadata Spreading In Peer-To-Peer Networks

Metadata Management Mechanism Based on Route Directory

Adaptive Cache Policy Scheduling for Big Data Applications on Distributed Tiered Storage System.

Distributed Resource Scheduling for Large-Scale MEC Systems: A Multiagent Ensemble Deep Reinforcement Learning With Imitation Acceleration

An Adaptive Control Mechanism for Access Control in Large-Scale Distributed Systems