Protein Representations: Encoding Biological Information for Machine Learning in Biocatalysis

David Harding-Larsen,Jonathan Funk,Niklas Gesmar Madsen,Hani Gharabli,Carlos G. Acevedo-Rocha,Stanislav Mazurenko,Ditte Hededam Welner
DOI: https://doi.org/10.26434/chemrxiv-2024-7hwf7
2024-04-03
Abstract:Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for industrial settings, an endeavor that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that facilitate in silico study and engineering of novel enzymatic properties. However, the conversion from the biological domain to the computational realm requires special attention to ensure the training of accurate and precise models. In this review, we examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations — primary sequence, 3D structure, and dynamics — to explore their requirements for employment and inherent biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors governing this choice. The first one is the model setup, being influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives, concerning the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Chemistry
What problem does this paper attempt to address?
This paper mainly discusses how to encode biological information effectively into numerical representations that can be used by machine learning models, in order to promote research in enzyme engineering and biocatalysis. Enzymes provide a more environmentally friendly and less impactful solution in industrial chemistry, but they often require additional engineering modifications to adapt to industrial environments, which is a time-consuming and challenging task. To address this problem, the paper reviews various methods for converting protein information (including primary sequence, 3D structure, and dynamics) into numerical representations that can be processed by machine learning algorithms. The authors analyze different encoding strategies, such as rule-based fixed representations (such as one-hot encoding, physicochemical property encoding, and BLOSUM matrix) as well as representations learned from the latent space of large neural networks. They also discuss the role of composite representations of both proteins and substrates in biocatalysis. The paper proposes two main factors for selecting appropriate protein representations: model settings (influenced by the size of the training dataset and architecture selection) and model objectives (involving differences in predicted properties, wild type and mutant predictions, and interpretability requirements). Furthermore, the paper introduces the applications of machine learning in enzyme engineering and biocatalysis, including activity and substrate specificity prediction, estimation of metabolic enzyme activity, protein stability and solubility prediction, etc. The authors emphasize the key role of data representation in enhancing or limiting the learning capabilities of models, and discuss the concept of inductive bias, which is the assumption of expected patterns in the data made by the model before learning tasks are performed, and these assumptions can be achieved through data representation strategies. In conclusion, this paper aims to provide information and guidance for the correct representation of enzymes in future machine learning models, in order to facilitate more effective enzyme engineering research.