Sequencing coverage analysis for combinatorial DNA-based storage systems

Inbal Preuss,Ben Galili,Zohar Yakhini,Leon Anavy
DOI: https://doi.org/10.1101/2024.01.10.574966
2024-01-10
Abstract:This study introduces a novel model for analyzing and determining the required sequencing coverage in DNA-based data storage, focusing on combinatorial DNA encoding. We explore the application of the coupon collector model for combinatorial-letter reconstruction, post-sequencing, which ensure efficient data retrieval and error reduction. We use a Markov Chain model to compute the probability of error-free reconstruction. We develop theoretical bounds on the decoding probability and use empirical simulations to validate these bounds. The work contributes to the understanding of sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. We provide a Python package that takes the code design and other message parameters as input, and then computes the required read coverage to guarantee reconstruction at a given desired confidence.
Synthetic Biology
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of sequencing coverage analysis in DNA data storage systems, with a particular focus on combinatorial DNA coding schemes. Specifically, the paper proposes a new model to calculate and determine the required sequencing coverage in DNA data storage and explores the application of combinatorial letter reconstruction to ensure efficient data retrieval and error reduction. The main contributions of the paper are as follows: 1. **Model Proposal**: A new model for analyzing sequencing coverage in combinatorial DNA coding is proposed, utilizing combinatorial letter reconstruction methods to ensure effective data recovery and error reduction. 2. **Theoretical Analysis**: The probability of error-free reconstruction is calculated using a Markov chain model, and theoretical bounds on decoding probability are provided. 3. **Empirical Validation**: These theoretical bounds are validated through empirical simulations, demonstrating the model's effectiveness. 4. **Tool Development**: A Python package is provided, which calculates the required read coverage to ensure data recovery at a given confidence level, based on input coding design and other message parameters. The paper focuses on understanding and optimizing sequencing coverage in DNA data storage, providing deep insights into decoding complexity, error correction, and sequence reconstruction.