Abstract:An area of medical science, that is, gaining prominence, is DNA sequencing. Genetic mutations responsible for the disease have been detected using DNA sequencing. The research is focusing on pattern identification methodologies for dealing with DNA-sequencing problems relating to various applications. A few examples of such problems are alignment and assembly of short reads from next generation sequencing (NGS), comparing DNA sequences, and determining the frequency of a pattern in a sequence. The approximate matching of DNA sequences is also well suited for many applications equivalent to the exact matching of the sequence since the DNA sequences are often subject to mutation. Consequently, recognizing pattern similarity becomes necessary. Furthermore, it can also be used in virtually every application that calls for pattern matching, for example, spell-checking, spam filtering, and search engines. According to the traditional approach, finding a similar pattern in the case where the sequence length is l s and the pattern length is l p occurs in O (l s ∗l p ). This heavy processing is caused by comparing every character of the sequence repeatedly with the pattern. The research intended to reduce the time complexity of the pattern matching by introducing an approach named "optimized pattern similarity identification" (OPSI). This methodology constructs a table, entitled "shift beyond for avoiding redundant comparison" (SBARC), to bypass the characters in the texts that are already compared with the pattern. The table pertains to the information about the character distance to be skipped in the matching. OPSI discovers at most spots of similar patterns occur in the sequence (by ignoring è mismatches). The experiment resulted in the time complexity identified as O (l s . è). In comparison to the size of the pattern, the allowed number of mismatches will be much smaller. Aspects such as scalability, generalizability, and performance of the OPSI algorithm are discussed. In comparison with the hamming distance-based approximate pattern matching algorithm, the proposed algorithm is found to be 69% more efficient.

Hybrid Indexes for Repetitive Datasets

A Compressed Self-Index for Genomic Databases

AliBI: An Alignment-Based Index for Genomic Datasets

Indexing All Life's Known Biological Sequences

Self-Index Based on LZ77

Computing Matching Statistics on Repetitive Texts

Text Indexing for Long Patterns using Locally Consistent Anchors

Lossless Indexing with Counting de Bruijn Graphs

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Optimal-Time Text Indexing in BWT-runs Bounded Space

DRESS: dimensionality reduction for efficient sequence search

Indexing Finite Language Representation of Population Genotypes

How to Find Long Maximal Exact Matches and Ignore Short Ones

Lightweight Pattern Matching Method for DNA Sequencing in Internet of Medical Things

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

Practical combinations of repetition-aware data structures

Engineering Relative Compression of Genomes

Mining DNA Sequence Patterns with Constraints Using Hybridization of Firefly and Group Search Optimization

VA-Store: A Virtual Approximate Store Approach to Supporting Repetitive Big Data in Genome Sequence Analyses

Molecular-level similarity search brings computing to DNA data storage

DUHI: Dynamically updated hash index clustering method for DNA storage