GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

Dvir Ben Shabat,Adar Hadad,Avital Boruchovsky,Eitan Yaakobi
DOI: https://doi.org/10.1093/bioinformatics/btae274
IF: 5.8
2024-04-22
Bioinformatics
Abstract:Abstract Motivation As data storage challenges grow and existing technologies approach their limits, synthetic DNA emerges as a promising storage solution due to its remarkable density and durability advantages. While cost remains a concern, emerging sequencing and synthetic technologies aim to mitigate it, yet introduce challenges such as errors in the storage and retrieval process. One crucial task in a DNA storage system is clustering numerous DNA reads into groups that represent the original input strands. Results In this paper, we review different methods for evaluating clustering algorithms and introduce a novel clustering algorithm for DNA storage systems, named Gradual Hash-based clustering (GradHC). The primary strength of GradHC lies in its capability to cluster with excellent accuracy various types of designs, including varying strand lengths, cluster sizes (including extremely small clusters), and different error ranges. Benchmark analysis demonstrates that GradHC is significantly more stable and robust than other clustering algorithms previously proposed for DNA storage, while also producing highly reliable clustering results. Availability and implementation https://github.com/bensdvir/GradHC Supplementary Information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
This paper focuses on a key issue in DNA storage systems: how to efficiently and accurately cluster DNA sequences. Existing technologies have improved the capacity and durability of DNA storage, but they are expensive and prone to errors during storage and retrieval. The task of clustering algorithms is to group a large number of DNA reads into clusters that represent the original input chains. In this paper, a new clustering algorithm called GradHC (Progressive Hash Clustering) is proposed, which can handle DNA designs of different lengths, sizes, and error ranges, and demonstrates higher stability and reliability compared to existing methods. Through benchmark tests, GradHC produces highly reliable results. The GradHC algorithm consists of three steps: first, the data is roughly segmented into blocks to reduce costs; then, within each block, local sensitive hashing (LSH) based on q-grams and the Sørensen-Dice coefficient are used for further clustering; finally, in the global clustering phase, all blocks are traversed, representatives are selected, and similar steps are performed to merge clusters from different blocks. This approach aims to effectively handle various errors and challenges that may arise in DNA storage systems, such as insertion, deletion, and substitution errors in the synthesis, PCR, and sequencing steps. The paper also introduces two metrics for evaluating the performance of clustering algorithms: Threat Score (TS) and Accuracy, and compares them with other existing algorithms. Based on these metrics, GradHC demonstrates superior performance in clustering in DNA storage systems. In addition, the paper discusses parameter selection and the time complexity of the algorithm.