THE Optimization of HDFS Based on Small Files

Liu Jiang,Bing Li,Meina Song
DOI: https://doi.org/10.1109/icbnmt.2010.5705223
2010-01-01
Abstract:HDFS is a distributed file system which can process large amounts of data effectively through large clusters, the HADOOP framework which is based on it has been widely used in various clusters to build large scale, high performance systems. However, HDFS is designed to handle large files and suffers performance penalty while dealing with large number of small files. There are many companies focus on cloud storage areas today, such as Amazon's s3 which provide data hosting. With the rapid development of Internet, users may be more tend to store their data and programs in the cloud computing platform in the future, the personal data has an obvious feature-most of them is small files, so HDFS can not meet this demand. In this article, we optimize the HDFS I/O feature based on small files, the basic idea is let one block save many small files and let the datanode save some meta-data of small files in it's memory. The experiment shows that our design can provide a better performance.
What problem does this paper attempt to address?