Performance Analysis of Hadoop for Handling Small Files in Single Node

YUAN Yu,CUI Chaoyuan,WU Yun,CHEN Zhuhong
DOI: https://doi.org/10.3778/j.issn.1002-8331.1206-0452
2013-01-01
Abstract:Hadoop is a software framework that supports distributed processing of large data sets, that is it works well with large files. There’s a doubt whether it also works well with small files. Taking the word frequency statistic as an example, through experiments with some typical file sets in a single node, Hadoop’s performance on small files under different FileInputFormat is compared. And the performance differences are explained by Hadoop’s own execution principle. Through analysis, packing many small files into one split can improve Hadoop performance.
What problem does this paper attempt to address?