Abstract:With the world generating digital data at an exponential rate, DNA has emerged as a promising archival medium. It offers a more efficient and long-lasting digital storage solution due to its durability, physical density, and high information capacity. Research in the field includes the development of encoding schemes, which are compatible with existing DNA synthesis and sequencing technologies. Recent studies suggest leveraging the inherent information redundancy of these technologies by using composite DNA alphabets. A major challenge in this approach involves the noisy inference process, which prevented the use of large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering a 6.5-fold increase in logical density over standard DNA-based storage systems, with near zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter represents a subset of shortmers. The nature of these combinatorial alphabets minimizes mix-up errors, while also ensuring the robustness of the system. As this paper will show, we formally define various combinatorial encoding schemes and investigate their theoretical properties, such as information density, reconstruction probabilities and required synthesis, and sequencing multiplicities. We then suggest an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional error correction codes, and reconstruction algorithms. Using simulations, we demonstrate our suggested approach and evaluate different combinatorial alphabets for encoding 10KB messages under different error regimes. The simulations reveal vital insights, including the relative manageability of nucleotide substitution errors over shortmer-level insertions and deletions. Sequencing coverage was found to be a key factor affecting the system performance, and the use of two-dimensional Reed-Solomon (RS) error correction has significantly improved reconstruction rates. Our experimental proof-of-concept validates the feasibility of our approach, by constructing two combinatorial sequences using Gibson assembly imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage, while raising theoretical research questions and technical challenges. These include the development of error correction codes for combinatorial DNA, the exploration of optimal sampling rates, and the advancement of DNA synthesis technologies that support combinatorial synthesis. Combining combinatorial principles with error-correcting strategies paves the way for efficient, error-resilient DNA-based storage solutions.

Sequencing coverage analysis for combinatorial DNA-based storage systems

Cover Your Bases: How to Minimize the Sequencing Coverage in DNA Storage Systems

Covering All Bases: The Next Inning in DNA Sequencing Efficiency

Efficient DNA-based data storage using shortmer combinatorial encoding

Sequence-Subset Distance and Coding for Error Control in DNA-based Data Storage

Error-Correcting Codes for Combinatorial Composite DNA

Coding Over Coupon Collector Channels for Combinatorial Motif-Based DNA Storage

A Combinatorial Perspective on Random Access Efficiency for DNA Storage

Concatenated Code Design for Constrained DNA Data Storage with Asymmetric Errors

Error-Correcting Codes for Nanopore Sequencing

Exact Error Exponents of Concatenated Codes for DNA Storage

Adaptive Coding for DNA Storage with High Storage Density and Low Coverage.

An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads

Coding for Composite DNA to Correct Substitutions, Strand Losses, and Deletions

Coded Shotgun Sequencing

Coding over Sets for DNA Storage

Improved Coding over Sets for DNA-Based Data Storage

Modular non-repeating codes for DNA storage

DNA-Based Storage: Models and Fundamental Limits

Challenges for error-correction coding in DNA data storage: photolithographic synthesis and DNA decay

D2Sim: A Computational Simulator for Nanopore Sequencing based DNA Data Storage