Abstract:In today's world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.

Addressing NameNode Scalability Issue in Hadoop Distributed File System using Cache Approach

A Novel Scalable Architecture of Cloud Storage System for Small Files Based on P2P

Addressing the Small Files Issue in Hadoop

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Design of A More Scalable Database System

NCluster: Using Multiple Active Name Nodes to Achieve High Availability for HDFS.

A Proposed Approach for Improving Hadoop Performance for Handling Small Files

Elastic HDFS: interconnected distributed architecture for availability–scalability enhancement of large-scale cloud storages

A Novel Approach for Improving Security and Storage Efficiency on HDFS

An archive‐based method for efficiently handling small file problems in HDFS

Performance Enhancement of Distributed System Using HDFS Federation and Sharding

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Overview of Caching Mechanisms to Improve Hadoop Performance

An Efficient Replicated System for the Metadata of HDFS

A Distributed Naming Mechanism in Scalable Cluster File System

Data Management Techniques in Hadoop Framework for Handling Small Files: A Survey

A Data-Aware Remote Procedure Call Method for Big Data Systems

CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

HBaseSpatial: A Scalable Spatial Data Storage Based on HBase

An Efficient and Compact Indexing Scheme for Large-Scale Data Store.

Cumulus: A Distributed File System Based on Network Coding