Levenshtein Distance Embedding with Poisson Regression for DNA Storage

Xiang Wei,Alan J.X. Guo,Sihan Sun,Mengyi Wei,Wei Yu
DOI: https://doi.org/10.1609/aaai.v38i14.29509
2023-12-13
Abstract:Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **efficiently computing or approximately computing the Levenshtein distance**, especially in DNA storage and other biological applications. The Levenshtein distance is a widely - used metric method for evaluating sequence similarity. Its computational complexity is relatively high, which poses a challenge for large - scale data processing, especially in the field of DNA storage. The paper proposes a neural - network - based sequence embedding technique and uses Poisson regression to solve this problem. ### Background and Motivation of the Paper The Levenshtein distance (also known as the edit distance) is defined as the minimum number of insertions, deletions, or substitutions required to transform one sequence into another. Although the dynamic programming algorithm can accurately calculate the Levenshtein distance, its time complexity is \(O(mn)\), where \(m\) and \(n\) are the lengths of the two sequences respectively. According to Theorem 1.1 of Backurs and Indyk (2015), for two sequences of length \(n\), the Levenshtein distance cannot be calculated in \(O(n^{2 - \delta})\) time, otherwise it will violate the strong exponential time hypothesis. Therefore, a linear - complexity Levenshtein distance calculation method is not feasible. With the rapid development of DNA storage technology, the application range of the Levenshtein distance is becoming wider and wider, including sequence clustering, sequence alignment, synchronous channel coding, etc. However, as the amount of information stored in DNA molecules continues to increase, the computational complexity of the Levenshtein distance has become an important challenge in these applications. ### Proposed Method The paper proposes a Levenshtein - distance - embedding technique based on a neural network and introduces Poisson regression. Specifically: 1. **Theoretical Analysis**: The paper first analyzes the influence of the embedding dimension on the performance of the model and provides a criterion for selecting an appropriate embedding dimension. 2. **Poisson Regression**: It is assumed that the Levenshtein distances between fixed - length sequences follow a Poisson distribution, which naturally conforms to the definition of the Levenshtein distance. In addition, from the perspective of the embedded - distance distribution, Poisson regression is approximately the negative log - likelihood of the chi - square distribution and helps to eliminate skewness. 3. **Experimental Verification**: Through comprehensive experiments on real DNA storage data, the superior performance of the proposed method has been proven, especially in comparison with the existing state - of - the - art methods. ### Key Contributions - **Theoretical Contribution**: Analyzed the influence of the embedding dimension on the performance of the model and provided a criterion for selecting an appropriate embedding dimension. - **Methodological Innovation**: Proposed a Levenshtein - distance - embedding technique based on Poisson regression. This technique not only naturally conforms to the definition of the Levenshtein distance but also can effectively eliminate skewness. - **Experimental Verification**: Verified the effectiveness and superiority of the proposed method through a large number of experiments. ### Conclusion Through theoretical analysis and experimental verification, the paper demonstrates the superior performance of the proposed Levenshtein - distance - embedding technique based on Poisson regression in fields such as DNA storage. This method not only improves computational efficiency but also solves the skewness problem existing in existing methods while maintaining high precision.