Abstract:With the ongoing trend of smart and Internet-connected objects being deployed across a broad range of applications, there is also a corresponding increase in the amount of data movement across different geographical regions. This, in turn, poses a number of challenges with respect to big data storage across multiple locations, including cloud computing platform. For example, the underlying distributed file system has a large number of directories and files in the form of gigantic trees, which are difficult to parse in polynomial time. Moreover, with the exponential increase of big data streams (i.e., unbounded sets of continuous data flows), challenges associated with indexing and membership queries are compounded. The capability to process such significant amount of data with high accuracy can have significant impact on decision-making and formulation of business and risk-related strategies, particularly in our current Industrial Internet of Things environment (IIoT). However, existing storage solutions are deterministic in nature. In other words, they tend to consume considerable memory and CPU time to yield accurate results. This necessitates the design of efficient quality of service-aware IIoT applications that are able to deal with the challenges of data storage and retrieval in the cloud computing environment. In this paper, we present an effective space-effective strategy for massive data storage using bloom filter (BF). Specifically, in the proposed scheme, the standard BF is extended to incorporate fuzzy-enabled folding approach, hereafter referred to as fuzzy folded BF (FFBF). In FFBF, fuzzy operations are used to accommodate the hashed data of one BF into another to reduce storage requirements. Evaluations on UCI ML AReM and Facebook datasets demonstrate the efficacy of FFBF, in terms of dealing with approximately 1.9 times more data as compared to using the standard BF. This is also achieved without affecting the false positive rate and query time.

TBF: a high-efficient query mechanism in de-duplication backup system

A Data Structure for Efficient File Deduplication in Cloud Storage

A Delayed Container Organization Approach to Improve Restore Speed for Deduplication Systems.

A Novel Optimization Method to Improve De-duplication Storage System Performance

Research on Data Routing Strategy of Deduplication in Cloud Environment

Finding Persistent Items Using Invertible Bloom Lookup Table

Two-layer partitioned and deletable deep bloom filter for large-scale membership query

Difference Bloom Filter: a Probabilistic Structure for Multi-set Membership Query

A Remote Data Backup System with Deduplication

MassStore: A Low Bandwidth, High De-duplication Efficiency Network Backup System

TDDFS: A Tier-Aware Data Deduplication-Based File System

Droplet: A Distributed Solution of Data Deduplication

Boafft: Distributed Deduplication for Big Data Storage in the Cloud

Matrix Bloom Filter: An Efficient Probabilistic Data Structure for 2-tuple Batch Lookup

Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams

TMBF: Bloom filter algorithms of time-dependent multi bit-strings for incremental set

RobustBF: A High Accuracy and Memory Efficient 2D Bloom Filter

Fuzzy-Folded Bloom Filter-as-a-Service for Big Data Storage in the Cloud

Low Computational Cost Bloom Filters

Distributed Backup Data Deduplication System Based on Data Routing

NameFilter: Achieving Fast Name Lookup with Low Memory Cost Via Applying Two-Stage Bloom Filters