Abstract:SummaryMultitiered storage systems, which are made up of heterogeneous devices, are widely used in distributed environments to accelerate the I/O performance of upper big data applications. It raises new challenges in efficient data migration through smart caching mechanisms among heterogeneous storage levels, such as MEM‐SSD‐HDD. To optimize the cache policy scheduling mechanism on the distributed tiered storage architecture, we proposed a general framework with five layers, including a tiered storage system layer, a cache migration policy layer, a cache policy adaptive scheduling layer, a data access pattern layer, and a big data application layer. The framework prototype has been designed and implemented on the widely used distributed hybrid storage system named Alluxio. To meet the demands of the big data application layer, on the one hand, we designed a couple of cache eviction policies and promotion policies covering various access patterns on the cache migration policy layer (several proposed eviction policies have been adopted by the Alluxio open‐source community). On the other hand, two adaptive cache policy scheduling algorithms for selecting appropriate policies in various scenarios are designed and implemented on the cache policy adaptive scheduling layer. The scheduling algorithms are designed based on the hit ratio statistics and data access pattern model prediction, respectively. Experimental results show that the proposed cache policies are very effective for various big data applications, such as Spark SQL. The proposed cache policy scheduling algorithms with various eviction policies can improve around 20% hit ratio than that with a single eviction policy.

TS-Hadoop: Handling Access Skew in MapReduce by Using Tiered Storage Infrastructure

Join Query Optimization Based on MapReduce under Skewed Data

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

The performance of MapReduce: an in-depth study

The Performance of MapReduce

Super Rack: Reusing the Results of Queries in MapReduce Systems

A Hierarchical Approach to Maximizing MapReduce Efficiency

Overview of Caching Mechanisms to Improve Hadoop Performance

Reusing the Results of Queries in MapReduce Systems by Adopting Shared Storage.

Adaptive Cache Policy Scheduling for Big Data Applications on Distributed Tiered Storage System.

Accelerating Big Data Applications on Tiered Storage System with Various Eviction Policies.

Column-Oriented Storage Techniques for MapReduce

Application and Storage-Aware Data Placement and Job Scheduling for Hadoop Clusters.

Location-Aware Data Block Allocation Strategy for HDFS-Based Applications in the Cloud

HM: A Column-Oriented MapReduce System on Hybrid Storage

A Real-Time Scheduling Strategy Based on Processing Framework of Hadoop

Hybrid storage architecture and efficient MapReduce processing for unstructured data

Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution

Automating distributed tiered storage management in cluster computing

Accelerating MapReduce on Commodity Clusters: an SSD-Empowered Approach