Abstract:In today's world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.

Addressing the Small Files Issue in Hadoop

A digital library architecture supporting massive small files and efficient replica maintenance.

A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

A Proposed Approach for Improving Hadoop Performance for Handling Small Files

Survey on Resource Management Solutions to Speed up Processing Small Files in Hadoop Cluster

Small Files Problem Resolution via Hierarchical Clustering Algorithm

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

An archive‐based method for efficiently handling small file problems in HDFS

Addressing NameNode Scalability Issue in Hadoop Distributed File System using Cache Approach

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

QDFS: A Quality-Aware Distributed File Storage Service Based on HDFS

Cloud Storage of Massive Remote Sensing Data Based on Distributed File System

Efficient Support of Big Data Storage Systems on the Cloud

Towards a New Model of Storage and Access to Data in Big Data and Cloud Computing

Elastic HDFS: interconnected distributed architecture for availability–scalability enhancement of large-scale cloud storages

A Massive Small File Storage Solution Combination of RDBMS and Hadoop

A Packaging Approach for Massive Amounts of Small Geospatial Files with HDFS.

A distributed storage method of remote sensing data based on image blocks organization

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster