Abstract:The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples. Advances in sequencing techniques have recently generated an explosion of protein sequence data. This represents an opportunity for scientists to develop theoretical and computational methods that can extract relevant biological information from these data samples. In this perspective, machine learning methods are proving to be particularly effective in the biological context. Since the majority of the accessible protein sequences are not-annotated, i.e. no information about the functional properties is known, unsupervised machine learning methods are particularly suited to tackle such raw sequence data. Here, we propose an unsupervised inference method which is meant to be applied to protein sequence data generated by an evolutionary process, whether it takes place in a controlled experimental framework or in-vivo. The method is devised to be simple enough to be applied to a plethora of different experimental setups, at the same time modeling the fundamental features of the dynamical processes underlying data generation. The ultimate goal of the method is to provide a sequence-fitness mapping that goes beyond the experimentally assessed sequence space, so to assign a quantitative functional score to each possible protein variant. The accurate knowledge of this mapping is key for several biological applications, such as biomolecule design and engineering, diagnostic and therapeutic treatments, and vaccine development.

Learning protein fitness models from evolutionary and assay-labeled data

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space

Protein Language Model Fitness Is a Matter of Preference

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Generative models for protein sequence modeling: recent advances and future directions

Designing diverse and high-performance proteins with a large language model in the loop

FLIGHTED: Inferring Fitness Landscapes from Noisy High-Throughput Experimental Data

Multi-Scale Representation Learning for Protein Fitness Prediction

Inference of annealed protein fitness landscapes with AnnealDCA

Learning protein fitness landscapes with deep mutational scanning data from multiple sources

Deciphering Protein Evolution and Fitness Landscapes with Latent Space Models.

Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Adaptive machine learning for protein engineering

Evolutionary context-integrated deep sequence modeling for protein engineering

Contrastive Fitness Learning: Reprogramming Protein Language Models for Low- Learning of Protein Fitness Landscape

Protein Language Models in Directed Evolution

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Metalic: Meta-Learning In-Context with Protein Language Models

Machine Learning for Protein Engineering