FECDO-Flexible and Efficient Coding for DNA Odyssey

Fajia Sun,Long Qian
DOI: https://doi.org/10.1101/2024.02.18.580107
2024-03-10
Abstract:DNA has been pursued as a compelling medium for digital data storage during the past decade. While large-scale data storage and random access have been achieved in artificial DNA, the synthesis cost keeps hindering DNA data storage from popularizing into daily life. In this study, we proposed a more efficient paradigm for digital data compressing to DNA, while excluding arbitrary sequence constraints. Both standalone neural networks and pre-trained language models were used to extract the intrinsic patterns of data, and generated probabilistic portrayal, which was then transformed into constraint-free nucleotide sequences with a hierarchical finite state machine. Utilizing these methods, a 12%-26% improvement of compression ratio was realized for various data, which directly translated to up to 26% reduction in DNA synthesis cost. Combined with the progress in DNA synthesis, our methods are expected to facilitate the realization of practical DNA data storage.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to efficiently compress digital data in DNA data storage to reduce the cost of DNA synthesis. Specifically, although existing DNA data storage methods have already achieved large-scale data storage and random access, the high synthesis cost remains the main obstacle to its widespread adoption in daily life. The synthesis cost is mainly proportional to the length of the nucleotide sequence that needs to be synthesized. Therefore, by shortening the length of the nucleotide sequence required to store specific data, the cost of DNA data storage can be significantly reduced. To achieve this goal, the authors propose a new method called FECDO (Flexible and Efficient Coding for DNA Odyssey). FECDO achieves efficient data compression through the following two main modules: 1. **Pattern Extraction Module**: Uses neural networks (including independent neural networks and pre-trained language models) to extract the intrinsic patterns of the data, generating probabilistic representations. 2. **Sequence Transformation Module**: Converts the probabilistic representations into unconstrained nucleotide sequences, using a hierarchical finite state machine to eliminate arbitrary sequence constraints. Through these methods, FECDO achieves a 12%-26% improvement in compression rates on various data, directly translating to up to a 26% reduction in DNA synthesis costs. Combined with advancements in DNA synthesis technology, FECDO is expected to promote the realization of practical DNA data storage.