Challenges for error-correction coding in DNA data storage: photolithographic synthesis and DNA decay

Andreas L Gimpel,Wendelin J. Stark,Reinhard Heckel,Robert N Grass
DOI: https://doi.org/10.1101/2024.07.04.602085
2024-07-04
Abstract:Efficient error-correction codes are crucial for realizing DNA's potential as a long-lasting, high-density storage medium for digital data. At the same time, new workflows promising low-cost, resilient DNA data storage are challenging their design and error-correcting capabilities. This study characterizes the errors and biases in two new additions to the state-of-the-art workflow in DNA data storage: photolithographic synthesis and DNA decay. Photolithographic synthesis offers low-cost, scalable oligonucleotide synthesis but suffers from high error rates, necessitating sophisticated error-correction schemes, for example codes introducing within-sequence redundancy combined with clustering and alignment techniques for retrieval. On the other hand, the decoding of oligo fragments after DNA decay promises unprecedented storage densities, but complicates data recovery by requiring the reassembly of full-length sequences or the use of partial sequences for decoding. Our analysis provides a detailed account of the error patterns and biases present in photolithographic synthesis and DNA decay, and identifies considerable bias stemming from sequencing workflows. We implement our findings into a digital twin of the two workflows, offering a tool for developing error-correction codes and providing benchmarks for the evaluation of codec performance.
Biochemistry
What problem does this paper attempt to address?
The paper attempts to address two key challenges in DNA data storage: errors and biases introduced by photolithographic synthesis and DNA decay. 1. **Photolithographic Synthesis**: Photolithographic synthesis is a low-cost and scalable method for oligonucleotide synthesis, but its high error rate necessitates complex error correction schemes to improve data recovery accuracy. Studies have found that despite the high physical redundancy and sequencing depth of photolithographic synthesis, there is still a high error rate, particularly deletion errors. Therefore, effectively utilizing this redundancy to generate consensus sequences with fewer errors becomes an important issue. 2. **DNA Decay**: During long-term storage, DNA undergoes decay, resulting in a large number of short fragments rather than complete oligonucleotide sequences. This complicates data recovery. Although the error rate after DNA decay is relatively low, the fragmented sequences pose a challenge for decoding. Research has found that the breakpoints generated after DNA decay are not uniformly distributed but are biased towards certain specific locations, further increasing the difficulty of data recovery. The paper analyzes existing sequencing data to detail the error patterns and biases present in these two processes and proposes corresponding solutions, including optimizing experimental procedures and developing new error correction coding techniques. Additionally, the study establishes a Digital Twin model to test and evaluate the performance of different error correction codes.