Training Compute-Optimal Protein Language Models

Xingyi Cheng,Bo Chen,Pan Li,Jing Gong,Jie Tang,Le Song

DOI: https://doi.org/10.1101/2024.06.06.597716

2024-06-09

Abstract:We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model (MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.

Bioinformatics

What problem does this paper attempt to address?

This paper mainly discusses the training methods for optimizing protein language models, which is an important area in biological research but lacks optimal practice guidelines. The study is based on a large dataset containing 93.9 billion protein sequences and trains over 300 models with parameter sizes ranging from 35 million to 107 billion, to investigate the relationship between model size, training token count, and objectives. First, the paper points out the diminishing returns and overfitting issues of the commonly used Causal Language Model (CLM) and Masked Language Model (MLM) on the Uniref database. To address these problems, researchers introduced metagenomic protein sequences to increase data diversity and avoid performance stagnation or overfitting. Second, the paper reveals the scale laws of CLM and MLM on the Transformer architecture, which are customized for the characteristics of protein sequence data. Additionally, a scale transfer phenomenon from CLM to MLM is discovered, where pretrained CLM models can effectively transfer to MLM and be quantified by estimating the effective transfer token count. Finally, the paper verifies these scale laws by comparing the performance of large-scale versions of ESM-2 and PROGEN2 on downstream tasks, including protein generation and structure- and function-related tasks, all of which are achieved under the same or lower pretraining computational budget. In summary, this paper aims to improve the training efficiency and performance of protein language models by optimizing the allocation strategy of computational resources, especially under limited computational budget.

Training Compute-Optimal Protein Language Models

Are Protein Language Models Compute Optimal?

Modeling Protein Using Large-scale Pretrain Language Model

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Protein Language Models: Is Scaling Necessary?

Efficient Inference, Training, and Fine-tuning of Protein Language Models

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Long-context Protein Language Model

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

ProGen2: Exploring the Boundaries of Protein Language Models

Cramming Protein Language Model Training in 24 GPU Hours

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models

Protein Language Model Fitness Is a Matter of Preference

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Training Compute-Optimal Large Language Models

Protein language models meet reduced amino acid alphabets