Abstract:With the world generating digital data at an exponential rate, DNA has emerged as a promising archival medium. It offers a more efficient and long-lasting digital storage solution due to its durability, physical density, and high information capacity. Research in the field includes the development of encoding schemes, which are compatible with existing DNA synthesis and sequencing technologies. Recent studies suggest leveraging the inherent information redundancy of these technologies by using composite DNA alphabets. A major challenge in this approach involves the noisy inference process, which prevented the use of large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering a 6.5-fold increase in logical density over standard DNA-based storage systems, with near zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter represents a subset of shortmers. The nature of these combinatorial alphabets minimizes mix-up errors, while also ensuring the robustness of the system. As this paper will show, we formally define various combinatorial encoding schemes and investigate their theoretical properties, such as information density, reconstruction probabilities and required synthesis, and sequencing multiplicities. We then suggest an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional error correction codes, and reconstruction algorithms. Using simulations, we demonstrate our suggested approach and evaluate different combinatorial alphabets for encoding 10KB messages under different error regimes. The simulations reveal vital insights, including the relative manageability of nucleotide substitution errors over shortmer-level insertions and deletions. Sequencing coverage was found to be a key factor affecting the system performance, and the use of two-dimensional Reed-Solomon (RS) error correction has significantly improved reconstruction rates. Our experimental proof-of-concept validates the feasibility of our approach, by constructing two combinatorial sequences using Gibson assembly imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage, while raising theoretical research questions and technical challenges. These include the development of error correction codes for combinatorial DNA, the exploration of optimal sampling rates, and the advancement of DNA synthesis technologies that support combinatorial synthesis. Combining combinatorial principles with error-correcting strategies paves the way for efficient, error-resilient DNA-based storage solutions.

GradHC: Highly Reliable Gradual Hash-based Clustering for DNA Storage Systems

DUHI: Dynamically updated hash index clustering method for DNA storage

Clover: tree structure-based efficient DNA clustering for DNA-based data storage

Beyond the Alphabet: Deep Signal Embedding for Enhanced DNA Clustering

Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning

High-scale random access on DNA storage systems

Robust Multi-read Reconstruction from Noisy Clusters Using Deep Neural Network for DNA Storage

Multidimensional data organization and random access in large-scale DNA storage systems

ADRS-CNet: An adaptive dimensionality reduction selection and classification network for DNA storage clustering algorithms

A Bird-Eye view on DNA Storage Simulators

DNACloud: A Potential Tool for storing Big Data on DNA

Managing Reliability Skew in DNA Storage

Scaling up DNA data storage and random access retrieval

Robust data storage in DNA by de Bruijn graph-based de novo strand assembly

Storage‐D: A user‐friendly platform that enables practical and personalized DNA data storage

Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems

Uncertainties in synthetic DNA-based data storage

Efficient DNA-based data storage using shortmer combinatorial encoding

Efficient data reconstruction: The bottleneck of large-scale application of DNA storage

Collision Aware Data Allocation In Multi-tube DNA Storage

Epistemology in the courtroom: a little "knowledge" is a dangerous thing.