A Parallel K‐means Clustering Algorithm Based on Redundance Elimination and Extreme Points Optimization Employing MapReduce

Zhuo Tang,Kunkun Liu,Jinbo Xiao,Li Yang,Zheng Xiao
DOI: https://doi.org/10.1002/cpe.4109
2017-01-01
Abstract:Parallel file systems commonly distribute a file across multiple file servers with a fixed-size stripe, thereby allowing data access through multiple file servers. This default data layout works well in traditional homogeneous storage systems, but when solid state disks (SSDs) are conducted into a storage system, the data layout of hybrid parallel file systems has a chance to obtain better I/O performance. In this study, we propose a variable-sized stripe level data layout strategy for hybrid parallel file systems (SLDP). SLDP divides the file into several regions according to the data access pattern and then finds the optimal configurations for each region among the solid state disk file server nodes and mechanical hard disk drive file server nodes. It uses variable stripe sizes to reorganize the data layout of file systems. Furthermore, it considers SSD space limitation, the main idea is to distribute key regions of the file to hybrid parallel file systems based on the optimal stripe configuration, which can significantly improve the system I/O throughput performance. The remaining parts of a file are then distributed according to the SSD free space threshold, which can leverage the SSD servers as much as possible. To achieve this, SLDP divides a large file into many fine-grained regions and adjusts the data layout method for each region according to the access patter. Experimental results show that the SLDP is feasible and can improve system performance. Copyright (C) 2016 John Wiley & Sons, Ltd.
What problem does this paper attempt to address?