Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Alange, Neeta
DOI: https://doi.org/10.1007/s10586-023-03992-1
2023-04-08
Cluster Computing
Abstract:In today's world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.
computer science, information systems, theory & methods
What problem does this paper attempt to address?