Abstract:With the world generating digital data at an exponential rate, DNA has emerged as a promising archival medium. It offers a more efficient and long-lasting digital storage solution due to its durability, physical density, and high information capacity. Research in the field includes the development of encoding schemes, which are compatible with existing DNA synthesis and sequencing technologies. Recent studies suggest leveraging the inherent information redundancy of these technologies by using composite DNA alphabets. A major challenge in this approach involves the noisy inference process, which prevented the use of large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering a 6.5-fold increase in logical density over standard DNA-based storage systems, with near zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter represents a subset of shortmers. The nature of these combinatorial alphabets minimizes mix-up errors, while also ensuring the robustness of the system. As this paper will show, we formally define various combinatorial encoding schemes and investigate their theoretical properties, such as information density, reconstruction probabilities and required synthesis, and sequencing multiplicities. We then suggest an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional error correction codes, and reconstruction algorithms. Using simulations, we demonstrate our suggested approach and evaluate different combinatorial alphabets for encoding 10KB messages under different error regimes. The simulations reveal vital insights, including the relative manageability of nucleotide substitution errors over shortmer-level insertions and deletions. Sequencing coverage was found to be a key factor affecting the system performance, and the use of two-dimensional Reed-Solomon (RS) error correction has significantly improved reconstruction rates. Our experimental proof-of-concept validates the feasibility of our approach, by constructing two combinatorial sequences using Gibson assembly imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage, while raising theoretical research questions and technical challenges. These include the development of error correction codes for combinatorial DNA, the exploration of optimal sampling rates, and the advancement of DNA synthesis technologies that support combinatorial synthesis. Combining combinatorial principles with error-correcting strategies paves the way for efficient, error-resilient DNA-based storage solutions.

A Combinatorial Perspective on Random Access Efficiency for DNA Storage

The Geometry of Codes for Random Access in DNA Storage

Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems

Design of DNA random access memory

High-scale random access on DNA storage systems

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Sequencing coverage analysis for combinatorial DNA-based storage systems

Efficient DNA-based data storage using shortmer combinatorial encoding

Geno-Weaving: Low-Complexity Capacity-Achieving DNA Storage

Multidimensional data organization and random access in large-scale DNA storage systems

Information-Theoretic Foundations of DNA Data Storage

Scaling up DNA data storage and random access retrieval

Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

Index-Based Concatenated Codes for the Multi-Draw DNA Storage Channel

High-density information storage and random access scheme using synthetic DNA

Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning

Constrained Channel Capacity for DNA-Based Data Storage Systems.

A Robust and Efficient DNA Storage Architecture Based on Modulation Encoding and Decoding

Fundamental Limits of DNA Storage Systems

Codes for Limited-Magnitude Probability Error in DNA Storage

On Conflict Free DNA Codes