Survey of Information Encoding Techniques for DNA

Thomas Heinis,Roman Sokolovskii,Jamie J. Alnasir
DOI: https://doi.org/10.1145/3626233
2023-10-03
Abstract:The yearly global production of data is growing exponentially, outpacing the capacity of existing storage media, such as tape and disk, and surpassing our ability to store it. DNA storage - the representation of arbitrary information as sequences of nucleotides - offers a promising storage medium. DNA is nature's information-storage molecule of choice and has a number of key properties: it is extremely dense, offering the theoretical possibility of storing 455 EB/g; it is durable, with a half-life of approximately 520 years that can be increased to thousands of years when DNA is chilled and stored dry; and it is amenable to automated synthesis and sequencing. Furthermore, biochemical processes that act on DNA potentially enable highly parallel data manipulation. Whilst biological information is encoded in DNA via a specific mapping from triplet sequences of nucleotides to amino acids, DNA storage is not limited to a single encoding scheme, and there are many possible ways to map data to chemical sequences of nucleotides for synthesis, storage, retrieval and data manipulation. However, there are several biological, error-tolerance and information-retrieval considerations that an encoding scheme needs to address to be viable. This comprehensive review focuses on comparing existing work done in encoding arbitrary data within DNA in terms of their encoding schemes, methods to address biological constraints and measures to provide error correction. We compare encoding approaches on the overall information density and coverage they achieve, as well as the data-retrieval method they use (i.e., sequential or random access). We also discuss the background and evolution of the encoding schemes.
Quantitative Methods,Databases,Data Structures and Algorithms,Information Theory
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily explores the use of DNA as a medium for data storage and compares various existing information encoding techniques. Specifically: 1. **Background and Challenges**: - The global data production is growing exponentially, surpassing the capacity of existing storage media (such as tapes and hard drives) and exceeding our storage capabilities. - Existing storage technologies may lead to data becoming unrecoverable in less than a century due to outdated hardware and software as well as physical degradation. 2. **Advantages of DNA Storage**: - Extremely high density: Theoretically, it can achieve an information density of 455EB per gram. - Longevity: Untreated DNA has a half-life of approximately 520 years, which can be extended to several thousand years under freeze-dried conditions. - Automated synthesis and sequencing technologies make DNA storage more feasible. 3. **Comparison of Encoding Schemes**: - The paper focuses on comparing several existing information encoding methods, including encoding schemes, methods to address biological limitations, and measures to provide error correction. - Comparison criteria include information density, required coverage, error detection and correction mechanisms, considerations of biological limitations, access mechanisms, and types and sizes of stored data. 4. **Specific Case Analysis**: - The article provides a detailed analysis of several different encoding methods, including Microvenus, the Genesis project, and long-term DNA storage methods. - These methods have their own advantages and disadvantages. For example, Microvenus lacks error detection and correction functions, while the Genesis project has issues with homopolymer runs and GC content imbalance. Through these comparisons, the paper aims to provide a theoretical foundation and technical guidance for future DNA storage technologies, addressing the challenges faced by current data storage solutions.