Pool PaRTI: A PageRank-based Pooling Method for Robust Protein Sequence Representation in Deep Learning

Alp TARTICI,Gowri Nayar,Russ B Altman
DOI: https://doi.org/10.1101/2024.10.04.616701
2024-10-05
Abstract:Motivation: Protein language models generate token-level embeddings for each residue, necessitating a method to pool these into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a task-agnostic pooling method that preserves more information from token embeddings while offering biological interpretability. Results: We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Pool PaRTI demonstrates statistically significant performance improvements across three diverse protein machine learning tasks, outperforming traditional pooling methods. It enhances accuracy, AUPRC, and MCC while offering interpretability by identifying biologically relevant regions without explicit structural training. Our approach is generalizable to encoder-containing protein language models. Availability and Implementation: Pool PaRTI is implemented in Python with PyTorch and is available at https://github.com/Helix-Research-Lab/Pool_PaRTI.git
Bioinformatics
What problem does this paper attempt to address?