FLOP: Tasks for Fitness Landscapes Of Protein wildtypes

Peter Mørch Groth,Richard Michael,Jesper Salomon,Pengfei Tian,Wouter Boomsma

DOI: https://doi.org/10.1101/2023.06.21.545880

2024-03-01

Abstract:Protein engineering has the potential to create optimized protein variants with improved properties and function. An initial step in the protein optimization process typically consists of a search among natural (wildtype) sequences to find the naturally occurring proteins with the most desirable properties. Promising candidates from this initial discovery phase then form the basis of the second step: a more local optimization procedure, exploring the space of variants separated from this candidate by a number of mutations. While considerable progress has been made on evaluating machine learning methods on single protein datasets, benchmarks of data-driven approaches for global fitness landscape exploration are still lacking. In this paper, we have carefully curated a representative benchmark dataset, which reflects industrially relevant scenarios for the initial wildtype discovery phase of protein engineering. We focus on exploration within a protein family, and investigate the downstream predictive power of various protein representation paradigms, i.e., protein language model-based representations, structure-based representations, and evolution-based representations. Our benchmark highlights the importance of coherent split strategies, and how we can be misled into overly optimistic estimates of the state of the field. The codebase and data can be accessed via .

Bioinformatics

What problem does this paper attempt to address?

The paper focuses on a key step in protein engineering, which is to identify proteins with desirable properties in wildtype proteins. This is the initial stage of protein optimization. The researchers propose a representative benchmark dataset called FLOP (Fitness Landscapes Of Protein wildtypes) to facilitate the evaluation of data-driven methods for exploring global adaptive landscapes. The current challenge is to screen proteins with the best attributes from natural protein sequences and then perform more localized optimization based on this. The paper emphasizes the importance of experimental design in limited data situations and proposes a sequence-identity-based stratified cross-validation strategy to avoid data leakage and overly optimistic performance estimation. The authors also explore different types of protein representation, including language model-based, structural, and evolutionary methods, and demonstrate the impact of these representations on downstream prediction performance. They use random forest regressors to handle low data scenarios and introduce zero-shot predictors as alternatives for simple tasks. In addition, the paper compares existing benchmarks such as TAPE, PEER, ProteinGym, and FLIP, pointing out that they mainly focus on predicting the effects of protein variants, while FLOP focuses on functional landscape characterization of wildtype proteins, which is an area that has not received sufficient attention. The experimental results show that the choice of protein representation has a significant impact on prediction performance, and on small datasets, simple models are not necessarily worse than complex models. The paper provides three challenging tasks and analyzes the biases that experimental design may introduce to promote progress in the field of protein engineering.

FLOP: Tasks for Fitness Landscapes Of Protein wildtypes

ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics.

Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction

Accelerating protein engineering with fitness landscape modeling and reinforcement learning

Low-N protein engineering with data-efficient deep learning

Improving Protein Optimization with Smoothed Fitness Landscapes

Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space

Heterogeneity of the GFP fitness landscape and data-driven protein design

Learning-Based Estimation of Fitness Landscape Ruggedness for Directed Evolution

Robust Model-Based Optimization for Challenging Fitness Landscapes

Persistent spectral theory-guided protein engineering

ProteinBench: A Holistic Evaluation of Protein Foundation Models

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Proximal Exploration for Model-guided Protein Sequence Design

Results of the Protein Engineering Tournament: An Open Science Benchmark for Protein Modeling and Design

Protein Language Model Fitness Is a Matter of Preference

FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data

Exploring protein fitness landscapes by directed evolution

PDBench: Evaluating Computational Methods for Protein Sequence Design