Uncertainty Quantification Using Ensemble Learning and Monte Carlo Sampling for Performance Prediction and Monitoring in Cell Culture Processes

Thanh Tung Khuat,Robert Bassett,Ellen Otte,Bogdan Gabrys
2024-09-03
Abstract:Biopharmaceutical products, particularly monoclonal antibodies (mAbs), have gained prominence in the pharmaceutical market due to their high specificity and efficacy. As these products are projected to constitute a substantial portion of global pharmaceutical sales, the application of machine learning models in mAb development and manufacturing is gaining momentum. This paper addresses the critical need for uncertainty quantification in machine learning predictions, particularly in scenarios with limited training data. Leveraging ensemble learning and Monte Carlo simulations, our proposed method generates additional input samples to enhance the robustness of the model in small training datasets. We evaluate the efficacy of our approach through two case studies: predicting antibody concentrations in advance and real-time monitoring of glucose concentrations during bioreactor runs using Raman spectra data. Our findings demonstrate the effectiveness of the proposed method in estimating the uncertainty levels associated with process performance predictions and facilitating real-time decision-making in biopharmaceutical manufacturing. This contribution not only introduces a novel approach for uncertainty quantification but also provides insights into overcoming challenges posed by small training datasets in bioprocess development. The evaluation demonstrates the effectiveness of our method in addressing key challenges related to uncertainty estimation within upstream cell cultivation, illustrating its potential impact on enhancing process control and product quality in the dynamic field of biopharmaceuticals.
Quantitative Methods,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to quantify the uncertainty of machine - learning predictions during cell - culture processes, especially in the case of limited training data. Specifically, the paper focuses on how to improve prediction performance and the ability to monitor process parameters in real - time through ensemble learning and Monte Carlo sampling methods in biopharmaceutical production, especially in the development and manufacturing process of monoclonal antibodies (mAbs). The method proposed in the paper aims to generate additional input samples to enhance model robustness in small - scale training datasets, and the effectiveness of the method has been verified through two case studies: 1. **Predicting antibody concentration one day in advance**: Use the current offline measurement values as input features to predict the antibody concentration one day in the future. 2. **Real - time monitoring of glucose concentration**: Use Raman spectroscopy data as input features to monitor the glucose concentration in real - time during the operation of the bioreactor. ### Main contributions 1. **Proposed a general framework**: Combine ensemble learning and Monte Carlo sampling to evaluate the uncertainty level of each prediction value, especially suitable for the case of small - scale training data. 2. **Applied case studies**: Verified the effectiveness of the method through two specific challenges (predicting antibody concentration in advance and real - time monitoring of glucose concentration). ### Method overview The method proposed in the paper includes the following steps: 1. **Generate synthetic training sets**: - Use the Monte Carlo sampling method to generate random values for each input feature and target variable based on the actual values and the coefficient of variation. - Generate \( N \) synthetic training sets, and each training set is used to train a base regressor. 2. **Construct an ensemble model**: - Train \( N \) base regressors, and each base regressor uses a synthetic training set. - For a new test sample \( X_T \), calculate the average value \( \hat{y}(X_T) \) and the standard deviation \( \sigma(X_T) \) of the prediction values of the \( N \) base regressors. 3. **Evaluate prediction uncertainty**: - Use the mean absolute error (MAE) as a performance indicator to evaluate the prediction performance of different models. - For models that return the standard deviation of the prediction values, calculate the MAE values of the upper bound \( \hat{y}+ 2\sigma \) and the lower bound \( \hat{y}- 2\sigma \). ### Experimental results The experimental results show that the proposed integrated SVR model is superior to the single SVR model in prediction performance. However, the performance of the integrated PLSR model is comparable to that of the single PLSR model. Compared with the Gaussian process (GP) model, although the GP model has the best prediction performance, its predicted uncertainty level is higher. This indicates that in practical applications, it is necessary to comprehensively consider the prediction performance and the uncertainty level. ### Conclusion The framework proposed in the paper not only provides an effective method to quantify the uncertainty of machine - learning predictions, but also provides new ideas for solving the challenges in small - scale training datasets. This is of great significance for improving the control of the biopharmaceutical production process and product quality.