Protein Representations: Encoding Biological Information for Machine Learning in Biocatalysis

David Harding-Larsen,Jonathan Funk,Niklas Gesmar Madsen,Hani Gharabli,Carlos G. Acevedo-Rocha,Stanislav Mazurenko,Ditte Hededam Welner

DOI: https://doi.org/10.26434/chemrxiv-2024-7hwf7

2024-04-03

Abstract:Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for industrial settings, an endeavor that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that facilitate in silico study and engineering of novel enzymatic properties. However, the conversion from the biological domain to the computational realm requires special attention to ensure the training of accurate and precise models. In this review, we examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations — primary sequence, 3D structure, and dynamics — to explore their requirements for employment and inherent biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors governing this choice. The first one is the model setup, being influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives, concerning the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.

Chemistry

What problem does this paper attempt to address?

This paper mainly discusses how to encode biological information effectively into numerical representations that can be used by machine learning models, in order to promote research in enzyme engineering and biocatalysis. Enzymes provide a more environmentally friendly and less impactful solution in industrial chemistry, but they often require additional engineering modifications to adapt to industrial environments, which is a time-consuming and challenging task. To address this problem, the paper reviews various methods for converting protein information (including primary sequence, 3D structure, and dynamics) into numerical representations that can be processed by machine learning algorithms. The authors analyze different encoding strategies, such as rule-based fixed representations (such as one-hot encoding, physicochemical property encoding, and BLOSUM matrix) as well as representations learned from the latent space of large neural networks. They also discuss the role of composite representations of both proteins and substrates in biocatalysis. The paper proposes two main factors for selecting appropriate protein representations: model settings (influenced by the size of the training dataset and architecture selection) and model objectives (involving differences in predicted properties, wild type and mutant predictions, and interpretability requirements). Furthermore, the paper introduces the applications of machine learning in enzyme engineering and biocatalysis, including activity and substrate specificity prediction, estimation of metabolic enzyme activity, protein stability and solubility prediction, etc. The authors emphasize the key role of data representation in enhancing or limiting the learning capabilities of models, and discuss the concept of inductive bias, which is the assumption of expected patterns in the data made by the model before learning tasks are performed, and these assumptions can be achieved through data representation strategies. In conclusion, this paper aims to provide information and guidance for the correct representation of enzymes in future machine learning models, in order to facilitate more effective enzyme engineering research.

Protein Representations: Encoding Biological Information for Machine Learning in Biocatalysis

Machine Learning-Guided Protein Engineering

Data‐Driven Protein Engineering for Improving Catalytic Activity and Selectivity

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Navigating the landscape of enzyme design: from molecular simulations to machine learning

Machine learning modeling of family wide enzyme-substrate specificity screens

Learning Strategies in Protein Directed Evolution

Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering

Learning the Language of Protein Structure

Machine learning for functional protein design

Accurate computational evolution of proteins and its dependence on deep learning

Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design

Physics-based Modeling in the New Era of Enzyme Engineering

A systematic review on the state-of-the-art strategies for protein representation

Machine Learning for Protein Engineering

Leveraging Structure for Enzyme Function Prediction: Methods, Opportunities, and Challenges.

Semantical and Geometrical Protein Encoding Toward Enhanced Bioactivity and Thermostability

Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering

Computational scoring and experimental evaluation of enzymes generated by neural networks