Using fuzzy reasoning to improve redundancy elimination for data deduplication in connected environments

Sylvana Yakhni,Joe Tekli,Elio Mansour,Richard Chbeir
DOI: https://doi.org/10.1007/s00500-023-07880-z
IF: 3.732
2023-03-16
Soft Computing
Abstract:The Internet of Things is ushering in the era of connected environments where the number and diversity of data sources (devices and sensors) are inevitably increasing the size of the data that need to be stored locally (at the edge device level) and transmitted to base storages (at the sink level) of the network. This huge amount of data highlights several challenges including network bandwidth, consumption of network energy, cloud storage, and I/O throughput. These call for data pre-processing and filtering solutions to reduce the amount of data being handled and transmitted over the network. In this study, we investigate data deduplication as a prominent pre-processing method that can be used and adapted to address such challenges. Data deduplication techniques have been traditionally developed for data storage and data warehousing applications and aim at identifying and eliminating redundant data items. Few recent approaches have been designed for connected environments, yet they share various limitations, including: (i) detecting duplicates at one level only of the network (either edge or sink exclusively), (ii) overlooking the context and dynamicity of the network (disregarding device mobility and overlooking boundary separations and sensor coverage areas), (iii) relying on crisp thresholds and providing minimum-to-no expert control over the deduplication process (disregarding the domain expert’s needs in defining redundancy). In this study, we propose FREDD, a new approach for Fuzzy Redundancy Elimination for Data Deduplication in a connected environment. FREDD uses simple natural language rules to represent domain knowledge and expert preferences regarding data duplication boundaries. It then applies pattern codes and fuzzy reasoning to detect duplicates at both the edge level and the sink level of the network. This reduces the time required to hard-code the deduplication process, while adapting to the domain expert’s needs for different data sources and applications. Moreover, FREDD is adapted for multiple scenarios, considering both static and mobile devices, with different configurations of hard-separated and soft-separated zones, and different sensor coverage areas in the connected environment. Experiments on a real-world dataset highlight FREDD’s potential and improvement compared with existing solutions.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?