Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning

Japheth E. Gado,Gregg T. Beckham,Christina M. Payne
DOI: https://doi.org/10.1101/2020.05.06.081737
2020-05-08
Abstract:ABSTRACT Accurate prediction of the optimal catalytic temperature (T opt ) of enzymes is vital in biotechnology, as enzymes with high T opt values are desired for enhanced reaction rates. Recently, a machine-learning method (TOME) for predicting T opt was developed. TOME was trained on a normally-distributed dataset with a median T opt of 37°C and less than five percent of T opt values above 85°C, limiting the method’s predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on T opt values greater than 85°C is nearly an order of magnitude higher than the error on values between 30 and 50°C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high T opt values (>85°C) by 60% and increase the overall R 2 value from 0.527 to 0.632. The revised method, TOMER, and the resampling strategies applied in this work are freely available to other researchers as a Python package on GitHub.
What problem does this paper attempt to address?