Abstract:With the world generating digital data at an exponential rate, DNA has emerged as a promising archival medium. It offers a more efficient and long-lasting digital storage solution due to its durability, physical density, and high information capacity. Research in the field includes the development of encoding schemes, which are compatible with existing DNA synthesis and sequencing technologies. Recent studies suggest leveraging the inherent information redundancy of these technologies by using composite DNA alphabets. A major challenge in this approach involves the noisy inference process, which prevented the use of large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering a 6.5-fold increase in logical density over standard DNA-based storage systems, with near zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter represents a subset of shortmers. The nature of these combinatorial alphabets minimizes mix-up errors, while also ensuring the robustness of the system. As this paper will show, we formally define various combinatorial encoding schemes and investigate their theoretical properties, such as information density, reconstruction probabilities and required synthesis, and sequencing multiplicities. We then suggest an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional error correction codes, and reconstruction algorithms. Using simulations, we demonstrate our suggested approach and evaluate different combinatorial alphabets for encoding 10KB messages under different error regimes. The simulations reveal vital insights, including the relative manageability of nucleotide substitution errors over shortmer-level insertions and deletions. Sequencing coverage was found to be a key factor affecting the system performance, and the use of two-dimensional Reed-Solomon (RS) error correction has significantly improved reconstruction rates. Our experimental proof-of-concept validates the feasibility of our approach, by constructing two combinatorial sequences using Gibson assembly imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage, while raising theoretical research questions and technical challenges. These include the development of error correction codes for combinatorial DNA, the exploration of optimal sampling rates, and the advancement of DNA synthesis technologies that support combinatorial synthesis. Combining combinatorial principles with error-correcting strategies paves the way for efficient, error-resilient DNA-based storage solutions.

Clover: tree structure-based efficient DNA clustering for DNA-based data storage

Improving the Single Template Method in DNA Computing

DUHI: Dynamically updated hash index clustering method for DNA storage

GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

DNA-SaM, a robust system for large-scale data storage

Data Clustering Algorithm for DNA Microarray Based on Graph Theory

Robust retrieval of data stored in DNA by de Bruijn graph-based de novo strand assembly

Storage‐D: A user‐friendly platform that enables practical and personalized DNA data storage

Explorer: efficient DNA coding by De Bruijn graph toward arbitrary local and global biochemical constraints

Molecular-level similarity search brings computing to DNA data storage

FECDO-Flexible and Efficient Coding for DNA Odyssey

DNA Bloom Filter enables anti-contamination and file version control for DNA-based data storage

Hidden Addressing Encoding for DNA Storage

Bio-Constrained Codes with Neural Network for Density-Based DNA Data Storage

Robust data storage in DNA by de Bruijn graph-based de novo strand assembly

BO-DNA: Biologically optimized encoding model for a highly-reliable DNA data storage

High-density information storage and random access scheme using synthetic DNA

Efficient data reconstruction: The bottleneck of large-scale application of DNA storage

CD-HIT: accelerated for clustering the next-generation sequencing data

DNA StairLoop: Achieving High Error-correcting and Parallel-processing Capabilities in DNA-based Data Storage

Efficient DNA-based data storage using shortmer combinatorial encoding