Protein domain embeddings for fast and accurate similarity search

Benjamin Giovanni Iovino,Haixu Tang,Yuzhen Ye
DOI: https://doi.org/10.1101/gr.279127.124
IF: 9.438
2024-10-12
Genome Research
Abstract:Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as DCT fingerprints ) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
genetics & heredity,biochemistry & molecular biology,biotechnology & applied microbiology
What problem does this paper attempt to address?