ProtBoost: protein function prediction with Py-Boost and Graph Neural Networks -- CAFA5 top2 solution

Alexander Chervov,Anton Vakhrushev,Sergei Fironov,Loredana Martignetti
2024-12-06
Abstract:Predicting protein properties, functions and localizations are important tasks in bioinformatics. Recent progress in machine learning offers an opportunities for improving existing methods. We developed a new approach called ProtBoost, which relies on the strength of pretrained protein language models, the new Py-Boost gradient boosting method and Graph Neural Networks (GCN). The ProtBoost method was ranked second best model in the recent Critical Assessment of Functional Annotation (CAFA5) international challenge with more than 1600 participants. Py-Boost is the first gradient boosting method capable of predicting thousands of targets simultaneously, making it an ideal fit for tasks like the CAFA challange. Our GCN-based approach performs stacking of many individual models and boosts the performance significantly. Notably, it can be applied to any task where targets are arranged in a hierarchical structure, such as Gene Ontology. Additionally, we introduced new methods for leveraging the graph structure of targets and present an analysis of protein language models for protein function prediction task. ProtBoost is publicly available at: <a class="link-external link-https" href="https://github.com/btbpanda/CAFA5-protein-function-prediction-2nd-place" rel="external noopener nofollow">this https URL</a>.
Quantitative Methods
What problem does this paper attempt to address?