The Protein Language Visualizer: Sequence Similarity Networks for the Era of Language Models

Javier Espinoza Herrera,María Fernanda Manríquez García,Sofía Medina Bermejo,Ailyn López Jasso,Karry Shi,Dyllan Mead,Sarah Marta Veskimägi,Maeve O'Connor,Adriana Siordia,Nathaniel Roethler,Adrian Jinich
DOI: https://doi.org/10.1101/2024.11.19.624229
2024-11-21
Abstract:The advent of high-throughput sequencing technologies and the availability of biological "big data" has accelerated the discovery of new protein sequences, making it challenging to keep pace with their functional annotation. To address this annotation challenge, techniques such as Sequence Similarity Networks (SSNs) have been employed to visually group proteins for faster identification. In this paper, we present an alternative visual analysis tool that uses Protein Language Model (PLM) embeddings. Our PLVis pipeline employs dimensionality reduction algorithms to cluster similar sequences, enabling rapid assessment of proteins based on their neighbors. Through analysis using average Jaccard distance and cosine similarity metrics, we found that well-separated clusters (those with silhouette scores above 0.95) captured high-dimensional information better than other regions of the projection. While proteins in poorly defined "fuzzy" regions showed similar embeddings to those in neighboring clusters, we note that distances in these projections should not be directly interpreted. To make this pipeline accessible to a wider research community, we have created a Google Colab Notebook for the comparison of protein datasets.
Biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that while high - throughput sequencing technology has accelerated the discovery of new protein sequences, how to effectively perform functional annotation of these proteins. Specifically, the paper proposes a new visual analysis tool based on protein language model (PLM) embedding - PLVis, which is used to cluster similar protein sequences, thereby achieving rapid assessment of protein functions. By using a dimensionality - reduction algorithm to convert high - dimensional PLM embeddings into two - dimensional projections, PLVis enables researchers to more intuitively explore the relationships between proteins, especially when dealing with large - scale protein data sets. In the paper, the author compared the performance of PLVis and the traditional BLAST - based sequence similarity network (SSN) method on a data set of 10,000 randomly selected rSAM enzymes, demonstrating the ability of PLVis to retain information in high - dimensional embedding spaces, especially in well - separated clusters. In addition, the author further verified the application value of PLVis in cross - species protein family comparison through multiple case studies, including the comparison of whole proteomes of different species. These studies not only prove the effectiveness of PLVis in exploring the protein sequence - function space, but also provide a new and efficient tool for protein functional annotation.