Addressing the Small Files Issue in Hadoop

Saranya Bangarusamy,Sonam Yadav,Mohammed Mahmoud,Sindhura Gali
DOI: https://doi.org/10.1109/CSCI54926.2021.00136
2021-12-01
Abstract:With the advancement of cloud technologies, Distributed File Systems (DFSs) are getting more attention. Recently, Hadoop Distributed File System (HDFS) has become the most popular Distributed File System based on a map-reduce framework to store very large-scale data. However, it has various shortcomings in handling small-sized file metadata, I/O performance, security issues, etc. In this paper, we address the small size file storage failure of HDFS, discuss and evaluate the various possible solutions for the same. Analysis of these solutions led to the conclusion of using more robust cloud options in the industry for storing very large-scale data in terms of scalability and cost. The major goal of this paper is to explore the strategies for overcoming the issues of handling small-sized files in HDFS, potentially known for handling large datasets faster, and discuss the reliability of its alternatives (AWS S3, DynamoDB, Azure Data Lake, Azure Blob Storage) by comparing their performance and security measures.
Computer Science,Engineering
What problem does this paper attempt to address?