Abstract:Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **efficiently computing or approximately computing the Levenshtein distance**, especially in DNA storage and other biological applications. The Levenshtein distance is a widely - used metric method for evaluating sequence similarity. Its computational complexity is relatively high, which poses a challenge for large - scale data processing, especially in the field of DNA storage. The paper proposes a neural - network - based sequence embedding technique and uses Poisson regression to solve this problem. ### Background and Motivation of the Paper The Levenshtein distance (also known as the edit distance) is defined as the minimum number of insertions, deletions, or substitutions required to transform one sequence into another. Although the dynamic programming algorithm can accurately calculate the Levenshtein distance, its time complexity is \(O(mn)\), where \(m\) and \(n\) are the lengths of the two sequences respectively. According to Theorem 1.1 of Backurs and Indyk (2015), for two sequences of length \(n\), the Levenshtein distance cannot be calculated in \(O(n^{2 - \delta})\) time, otherwise it will violate the strong exponential time hypothesis. Therefore, a linear - complexity Levenshtein distance calculation method is not feasible. With the rapid development of DNA storage technology, the application range of the Levenshtein distance is becoming wider and wider, including sequence clustering, sequence alignment, synchronous channel coding, etc. However, as the amount of information stored in DNA molecules continues to increase, the computational complexity of the Levenshtein distance has become an important challenge in these applications. ### Proposed Method The paper proposes a Levenshtein - distance - embedding technique based on a neural network and introduces Poisson regression. Specifically: 1. **Theoretical Analysis**: The paper first analyzes the influence of the embedding dimension on the performance of the model and provides a criterion for selecting an appropriate embedding dimension. 2. **Poisson Regression**: It is assumed that the Levenshtein distances between fixed - length sequences follow a Poisson distribution, which naturally conforms to the definition of the Levenshtein distance. In addition, from the perspective of the embedded - distance distribution, Poisson regression is approximately the negative log - likelihood of the chi - square distribution and helps to eliminate skewness. 3. **Experimental Verification**: Through comprehensive experiments on real DNA storage data, the superior performance of the proposed method has been proven, especially in comparison with the existing state - of - the - art methods. ### Key Contributions - **Theoretical Contribution**: Analyzed the influence of the embedding dimension on the performance of the model and provided a criterion for selecting an appropriate embedding dimension. - **Methodological Innovation**: Proposed a Levenshtein - distance - embedding technique based on Poisson regression. This technique not only naturally conforms to the definition of the Levenshtein distance but also can effectively eliminate skewness. - **Experimental Verification**: Verified the effectiveness and superiority of the proposed method through a large number of experiments. ### Conclusion Through theoretical analysis and experimental verification, the paper demonstrates the superior performance of the proposed Levenshtein - distance - embedding technique based on Poisson regression in fields such as DNA storage. This method not only improves computational efficiency but also solves the skewness problem existing in existing methods while maintaining high precision.

Levenshtein Distance Embedding with Poisson Regression for DNA Storage

Gene Prediction by the Noise-Assisted MEMD and Wavelet Transform for Identifying the Protein Coding Regions

DoDo-Code: a Deep Levenshtein Distance Embedding-based Code for IDS Channel and DNA Storage

NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search

Implicit Neural Multiple Description for DNA-based data storage

Bio-Constrained Codes with Neural Network for Density-Based DNA Data Storage

Deep Joint Source-Channel Coding for DNA Image Storage: A Novel Approach with Enhanced Error Resilience and Biological Constraint Optimization

Hidden Addressing Encoding for DNA Storage

Needleman-Wunsch Attention: A Framework for Enhancing DNA Sequence Embedding

Deep Hashing Based Model for Image Similarity Retrieval in DNA Storage

Content-Based Similarity Search in Large-Scale DNA Data Storage Systems

Beyond the Alphabet: Deep Signal Embedding for Enhanced DNA Clustering

A constrained Shannon-Fano entropy coder for image storage in synthetic DNA

Molecular-level similarity search brings computing to DNA data storage

Limit and screen sequences with high degree of secondary structures in DNA storage by deep learning method

Minimum Free Energy Coding for DNA Storage

Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning

High-density information storage and random access scheme using synthetic DNA

DNA Steganalysis Using Deep Recurrent Neural Networks

Rotating labeling of entropy coders for synthetic DNA data storage

Nucleosome positioning based on DNA sequence embedding and deep learning