Inference of annealed protein fitness landscapes with AnnealDCA
Luca Sesta,Andrea Pagnani,Jorge Fernandez-de-Cossio-Diaz,Guido Uguzzoni
DOI: https://doi.org/10.1371/journal.pcbi.1011812
2024-02-21
PLoS Computational Biology
Abstract:The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples. Advances in sequencing techniques have recently generated an explosion of protein sequence data. This represents an opportunity for scientists to develop theoretical and computational methods that can extract relevant biological information from these data samples. In this perspective, machine learning methods are proving to be particularly effective in the biological context. Since the majority of the accessible protein sequences are not-annotated, i.e. no information about the functional properties is known, unsupervised machine learning methods are particularly suited to tackle such raw sequence data. Here, we propose an unsupervised inference method which is meant to be applied to protein sequence data generated by an evolutionary process, whether it takes place in a controlled experimental framework or in-vivo. The method is devised to be simple enough to be applied to a plethora of different experimental setups, at the same time modeling the fundamental features of the dynamical processes underlying data generation. The ultimate goal of the method is to provide a sequence-fitness mapping that goes beyond the experimentally assessed sequence space, so to assign a quantitative functional score to each possible protein variant. The accurate knowledge of this mapping is key for several biological applications, such as biomolecule design and engineering, diagnostic and therapeutic treatments, and vaccine development.
biochemical research methods,mathematical & computational biology