Abstract:BackgroundA major problem arising from searching across bibliographic databases is the retrieval of duplicate citations. Removing such duplicates is an essential task to ensure systematic reviewers do not waste time screening the same citation multiple times. Although reference management software use algorithms to remove duplicate records, this is only partially successful and necessitates removing the remaining duplicates manually. This time-consuming task leads to wasted resources. We sought to evaluate the effectiveness of a newly developed deduplication program against EndNote.MethodsA literature search of 1,988 citations was manually inspected and duplicate citations identified and coded to create a benchmark dataset. The Systematic Review Assistant-Deduplication Module (SRA-DM) was iteratively developed and tested using the benchmark dataset and compared with EndNote’s default one step auto-deduplication process matching on (‘author’, ‘year’, ‘title’). The accuracy of deduplication was reported by calculating the sensitivity and specificity. Further validation tests, with three additional benchmarked literature searches comprising a total of 4,563 citations were performed to determine the reliability of the SRA-DM algorithm.ResultsThe sensitivity (84%) and specificity (100%) of the SRA-DM was superior to EndNote (sensitivity 51%, specificity 99.83%). Validation testing on three additional biomedical literature searches demonstrated that SRA-DM consistently achieved higher sensitivity than EndNote (90% vs 63%), (84% vs 73%) and (84% vs 64%). Furthermore, the specificity of SRA-DM was 100%, whereas the specificity of EndNote was imperfect (average 99.75%) with some unique records wrongly assigned as duplicates. Overall, there was a 42.86% increase in the number of duplicates records detected with SRA-DM compared with EndNote auto-deduplication.ConclusionsThe Systematic Review Assistant-Deduplication Module offers users a reliable program to remove duplicate records with greater sensitivity and specificity than EndNote. This application will save researchers and information specialists time and avoid research waste. The deduplication program is freely available online.

Emerging Research Trends in Data Deduplication: A Bibliometric Analysis from 2010 to 2023

Data Deduplication Techniques for Big Data Storage Systems

Redundancy elimination in IoT oriented big data: a survey, schemes, open challenges and future applications

PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication.

A bibliographic study on big data: concepts, trends and challenges

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment

Decentralized and Privacy Sensitive Data De-Duplication Framework for Convenient Big Data Management in Cloud Backup Systems

Exploring the Landscape of Big Data Applications in Librarianship: a Bibliometric Analysis of Research Trends and Patterns

ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems

Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module

Reducing Data Fragmentation in Data Deduplication Systems via Partial Repetition and Coding

SAUD: Semantics-Aware and Utility-Driven Deduplication Framework for Primary Storage.

Considerations for conducting systematic reviews: A follow-up study to evaluate the performance of various automated methods for reference de-duplication

A Novel Optimization Method to Improve De-duplication Storage System Performance

Mapping the Evolving Landscape of Cloud Computing Research: A Bibliometric Analysis

A secure framework for managing data in cloud storage using rapid asymmetric maximum based dynamic size chunking and fuzzy logic for deduplication

Convergent encryption enabled secure data deduplication algorithm for cloud environment

A bibliometric review on serendipity literature available in Web of Science database using HistCite and Biblioshiny

Considerations for conducting systematic reviews: evaluating the performance of different methods for de-duplicating references

Digital Innovation, Data Analytics, and Supply Chain Resiliency: A Bibliometric-based Systematic Literature Review