FLOP: Tasks for Fitness Landscapes Of Protein wildtypes

Peter Mørch Groth,Richard Michael,Jesper Salomon,Pengfei Tian,Wouter Boomsma
DOI: https://doi.org/10.1101/2023.06.21.545880
2024-03-01
Abstract:Protein engineering has the potential to create optimized protein variants with improved properties and function. An initial step in the protein optimization process typically consists of a search among natural (wildtype) sequences to find the naturally occurring proteins with the most desirable properties. Promising candidates from this initial discovery phase then form the basis of the second step: a more local optimization procedure, exploring the space of variants separated from this candidate by a number of mutations. While considerable progress has been made on evaluating machine learning methods on single protein datasets, benchmarks of data-driven approaches for global fitness landscape exploration are still lacking. In this paper, we have carefully curated a representative benchmark dataset, which reflects industrially relevant scenarios for the initial wildtype discovery phase of protein engineering. We focus on exploration within a protein family, and investigate the downstream predictive power of various protein representation paradigms, i.e., protein language model-based representations, structure-based representations, and evolution-based representations. Our benchmark highlights the importance of coherent split strategies, and how we can be misled into overly optimistic estimates of the state of the field. The codebase and data can be accessed via .
Bioinformatics
What problem does this paper attempt to address?
The paper focuses on a key step in protein engineering, which is to identify proteins with desirable properties in wildtype proteins. This is the initial stage of protein optimization. The researchers propose a representative benchmark dataset called FLOP (Fitness Landscapes Of Protein wildtypes) to facilitate the evaluation of data-driven methods for exploring global adaptive landscapes. The current challenge is to screen proteins with the best attributes from natural protein sequences and then perform more localized optimization based on this. The paper emphasizes the importance of experimental design in limited data situations and proposes a sequence-identity-based stratified cross-validation strategy to avoid data leakage and overly optimistic performance estimation. The authors also explore different types of protein representation, including language model-based, structural, and evolutionary methods, and demonstrate the impact of these representations on downstream prediction performance. They use random forest regressors to handle low data scenarios and introduce zero-shot predictors as alternatives for simple tasks. In addition, the paper compares existing benchmarks such as TAPE, PEER, ProteinGym, and FLIP, pointing out that they mainly focus on predicting the effects of protein variants, while FLOP focuses on functional landscape characterization of wildtype proteins, which is an area that has not received sufficient attention. The experimental results show that the choice of protein representation has a significant impact on prediction performance, and on small datasets, simple models are not necessarily worse than complex models. The paper provides three challenging tasks and analyzes the biases that experimental design may introduce to promote progress in the field of protein engineering.