Recursive Cleaning for Large-scale Protein Data via Multimodal Learning

Zixuan Jiang,Sitao Zhang,Jiahang Cao,Qiang Zhang,Shiyi Liu,Yuetong Fang,Lingfeng Zhang,Rui Qing,Renjing Xu
DOI: https://doi.org/10.1101/2024.10.08.617190
2024-10-12
Abstract:Reliable datasets and high-performance models work together to drive significant advancements in protein representation learning in the era of Artificial Intelligence. The size of protein models and datasets has grown exponentially in recent years. However, the quality of protein knowledge and model training has suffered from the lack of accurate and efficient data annotation and cleaning methods. To address this challenge, we introduce ProtAC, which corrects large Protein datasets with a scalable Automatic Cleaning framework that leverages both sequence and functional information through multimodal learning. To fulfill data cleaning, we propose the Sequence-Annotation Matching (SAM) module in the model, which filters the functional annotations that are more suitable for the corresponding sequences. Our approach is a cyclic process consisting of three stages: first pretraining the model on a large noisy dataset, then finetuning the model on a small manually annotated dataset, and finally cleaning the noisy dataset using the finetuned model. Through multiple rounds of "train-finetune-clean" cycles, we observe progressive improvement in protein function prediction and sequence-annotation matching. As a result, we achieve (1) a state-of-the-art (SOTA) model that outperforms competitors with fewer than 100M parameters, evaluated on multiple function-related downstream tasks, and (2) a cleaned UniRef50 dataset containing ~50M proteins with well-annotated functions. Performing extensive biological analysis on a cleaned protein dataset, we demonstrate that our model is able to understand the relationships between different functional annotations in proteins and that proposed functional annotation revisions are reasonable.
Bioengineering
What problem does this paper attempt to address?