Abstract:In today's world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.

A Proposed Approach for Improving Hadoop Performance for Handling Small Files

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

Storage-Optimization Method for Massive Small Files of Agricultural Resources Based on Hadoop

Survey on Resource Management Solutions to Speed up Processing Small Files in Hadoop Cluster

Addressing the Small Files Issue in Hadoop

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

An archive‐based method for efficiently handling small file problems in HDFS

A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P

Small Files Problem Resolution via Hierarchical Clustering Algorithm

An Approach in Big Data Analytics to Improve the Velocity of Unstructured Data Using MapReduce

CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files

An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce

A Packaging Approach for Massive Amounts of Small Geospatial Files with HDFS.

A Novel Approach for Improving Security and Storage Efficiency on HDFS

An Approach of Fast Data Manipulation in HDFS with Supplementary Mechanisms

Addressing NameNode Scalability Issue in Hadoop Distributed File System using Cache Approach

Contributions to Hadoop File System Architecture by Revising the File System Usage Along with Automatic Service

Column-Oriented Storage Techniques for MapReduce

Improving Downloading Performance in Hadoop Distributed File System