Abstract:Motivation There are now 225 million sequences in the UniProtKB database, as of January 2022 and 451 million protein sequences in the NCBI non-redundant database. This huge sequence data is ripe for analysis and can be extremely informative about biological function when analyzed with the appropriate methods. Evolutionary information such as the relationship among protein sequences is key to performing sequence analyses. Since sequence matching is one of the primary ways that annotations are found, higher-quality sequence matches yield a larger number of identified homologs. Thus, there is an essential need for a faster and more accurate homolog detection method to process the huge amount of rapidly growing biological sequences. Method Recently, we have seen major improvements in various predictive computational tasks such as structure prediction from the ever-improving artificial intelligence methods. One such approach has been to use language models to represent proteins numerically in a representation matrix (embeddings) while retaining context-dependent biochemical, biophysical, and evolutionary information. Computational transformer architectures that utilize attention neural networks can generate these context-aware numerical representations in an unsupervised fashion. One such use for these protein embeddings is remote homolog detection. In this work, we utilize protein language models and then apply discrete cosine transforms to extract the essential part of these embeddings, resulting in a significantly smaller fixed-size matrix for each sequence. This allows us to numerically and efficiently calculate the distance between all pairs of proteins resulting in homolog detection. Results Our Protein LAnguage model Search Tool (PLAST) is significantly faster, with linear runtimes in the number of sequences within the query database. With only one CPU core, it can scan a million sequences in less than a second. It essentially removes the noise in the sequence data and leads to significant improvements. PLAST is more accurate in the benchmarks tested from the PFAM, SCOP, and CATH databases than other approaches. When benchmarked with the PFAM database, the increase in the area under the receiver operating characteristic curve (AUROC) 3.1% when compared with NCBI-BLAST. The number of remote homologs that are detectable now is significantly larger and pushes sequence matches deeply into the usual twilight zone. Compared with the state-of-the-art profile-based homology search tools like CSBLAST, the increase was still 2.0%. PLAST can find remote homologs for a significant number of proteins that had been thought to be unique due to homolog detection failure. These homologs that are found usually have less than 20% sequence identity making them indistinguishable from noise with most other sequence matching methods. Conclusion PLAST is an accurate and fast homolog detection tool essential for easy and rapid progress to utilize the vast amount of data generated by next-generation sequencing methods. Quantization of sequence embeddings into highly-compressed noise-free representations with the use of direct cosine transforms allows for the efficient and accurate detection of normal homologs and remote ones that are un-detectable by other sequence similarity methods. The PLAST web server is accessible from https://mesihk.github.io/plast .

Exploiting protein language model sequence representations for repeat detection

Sequence Repetitiveness Quantification and De Novo Repeat Detection by Weighted K-Mer Coverage.

A Sensitive Repeat Identification Framework Based on Short and Long Reads

Msrepdb: a Comprehensive Repetitive Sequence Database of over 80 000 Species.

De Novo Repeat Detection Based on the Third Generation Sequencing Reads

Accurate Detection of Tandem Repeats from Error-Prone Sequences with EquiRep

Nonlinear Analysis of Sequence Repeats of Multi-Domain Proteins

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Protein Language Model Performs Efficient Homology Detection

RepeatFiller newly identifies megabases of aligning repetitive sequences and improves annotations of conserved non-exonic elements

Identification of WD40 Repeats by Secondary Structure-Aided Profile-Profile Alignment.

A New Statistic for Efficient Detection of Repetitive Sequences

RepAHR: an Improved Approach for De Novo Repeat Identification by Assembly of the High-Frequency Reads

RepeatParam: Algorithm for Parameterising Repeat Proteins and Analysis of Repeat Protein Architectures

Daisy: An integrated repeat protein curation service

Protein language model powers accurate and fast sequence search for remote homology

Structure-aware annotation of leucine-rich repeat domains

Detectir: a Novel Program for Detecting Perfect and Imperfect Inverted Repeats Using Complex Numbers and Vector Calculation.

Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles from Hidden Markov Models.

A Discriminative Method for Protein Remote Homology Detection and Fold Recognition Combining Top-n-grams and Latent Semantic Analysis

RepeatsDB in 2025: expanding annotations of structured tandem repeats proteins on AlphaFoldDB