A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Yiqing Shen,Zan Chen,Michail Mamalakis,Luhan He,Haiyang Xia,Tianbin Li,Yanzhou Su,Junjun He,Yu Guang Wang

2024-07-09

Abstract:The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: "Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language?" Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for protein understanding effectively. To bridge this gap, we introduce ProteinLMDataset, a dataset specifically designed for further self-supervised pretraining and supervised fine-tuning (SFT) of LLMs to enhance their capability for protein sequence comprehension. Specifically, ProteinLMDataset includes 17.46 billion tokens for pretraining and 893,000 instructions for SFT. Additionally, we present ProteinLMBench, the first benchmark dataset consisting of 944 manually verified multiple-choice questions for assessing the protein understanding capabilities of LLMs. ProteinLMBench incorporates protein-related details and sequences in multiple languages, establishing a new standard for evaluating LLMs' abilities in protein comprehension. The large language model InternLM2-7B, pretrained and fine-tuned on the ProteinLMDataset, outperforms GPT-4 on ProteinLMBench, achieving the highest accuracy score.

Quantitative Methods,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper focuses on how to use large-scale language models (LLMs) to understand and process protein sequences. The current problem is that although LLMs perform well in natural language processing, their potential in understanding proteins is limited due to the lack of a comprehensive dataset that directly links protein sequences with textual descriptions. To address this issue, the paper proposes two key contributions: 1. ProteinLMDataset: a large protein sequence and text hybrid dataset designed specifically for LLMs, consisting of 17.46 billion pre-training tokens and 893,000 supervised fine-tuning (SFT) instructions. This dataset aims to enable LLMs to learn the correspondence between protein sequences and their textual descriptions, enhancing their ability to understand protein sequences. 2. ProteinLMBench: the first comprehensive manually curated benchmark dataset that includes 944 multiple-choice questions for evaluating LLMs' understanding of protein sequences. These questions cover various languages and protein-related details, setting a new standard for evaluating LLMs' performance in protein understanding. By pre-training and fine-tuning the large language model InternLM2-7B on the ProteinLMDataset, it outperforms GPT-4 on ProteinLMBench, demonstrating higher accuracy. The paper also highlights the limitations of existing datasets and how the newly proposed ProteinLMDataset and ProteinLMBench address these limitations, promoting the intersection of protein science and deep learning models.

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

Modeling Protein Using Large-scale Pretrain Language Model

Fine-tuning protein language models boosts predictions across diverse tasks

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

Training Compute-Optimal Protein Language Models

Evaluating large language models for annotating proteins

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

Language modelling for biological sequences – curated datasets and baselines

Benchmarking Large Language Models for Molecule Prediction Tasks

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

Are Genomic Language Models All You Need? Exploring Genomic Language Models on Protein Downstream Tasks

Long-context Protein Language Model

ProteinBench: A Holistic Evaluation of Protein Foundation Models

PLM_Sol: predicting protein solubility by benchmarking multiple protein language models with the updated protein solubility dataset