Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Christoph Lange,Isabel Thiele,Lara Santolin,Sebastian L. Riedel,Maxim Borisyak,Peter Neubauer,M. Nicolas Cruz Bournazou
2024-02-02
Abstract:In biotechnology Raman Spectroscopy is rapidly gaining popularity as a process analytical technology (PAT) that measures cell densities, substrate- and product concentrations. As it records vibrational modes of molecules it provides that information non-invasively in a single spectrum. Typically, partial least squares (PLS) is the model of choice to infer information about variables of interest from the spectra. However, biological processes are known for their complexity where convolutional neural networks (CNN) present a powerful alternative. They can handle non-Gaussian noise and account for beam misalignment, pixel malfunctions or the presence of additional substances. However, they require a lot of data during model training, and they pick up non-linear dependencies in the process variables. In this work, we exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels so that a network trained on such data exhibits low correlations between the model predictions. We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training. This data augmentation technique enables us to reuse spectra as training data for new contexts that exhibit different correlations. The additional data allows for building a better and more robust model. This is of interest in scenarios where large amounts of historical data are available but are currently not used for model training. We demonstrate the capabilities of the proposed method using synthetic spectra of Ralstonia eutropha batch cultivations to monitor substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations during of the experiments.
Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The paper attempts to address the issue of how to handle highly correlated annotations in data when using Raman Spectroscopy as a Process Analytical Technology (PAT) to monitor cell density, substrate, and product concentrations in biotechnology. Specifically, the paper focuses on how to generate additional data points through data augmentation methods to reduce the correlation between model predictions, thereby improving the generalization ability and robustness of Convolutional Neural Networks (CNNs) under different experimental conditions. ### Background and Problem 1. **Application of Raman Spectroscopy**: - Raman Spectroscopy is becoming increasingly popular in biotechnology because it can non-invasively measure cell density, substrate, and product concentrations. - Partial Least Squares (PLS) is typically used to infer variables of interest from spectra, but Convolutional Neural Networks (CNNs) are gradually being adopted due to their powerful predictive capabilities. 2. **Complexity of Biological Processes**: - Biological processes are highly complex, with many non-Gaussian noises, beam misalignments, pixel failures, or the influence of additional substances. - CNNs can handle these complexities but require a large amount of training data and are prone to capturing nonlinear dependencies in process variables. 3. **Problem of Data Correlation**: - During cultivation, data usually have strong correlations. For example, in batch cultivation, the substrate is negatively correlated with biomass. - This correlation allows the model to perform well under similar conditions but causes the prediction quality to drop rapidly when applied to different processes (e.g., fed-batch cultivation). ### Solution The paper proposes a data augmentation scheme to "erase" the correlation in training data, making the model applicable to a wider range of process conditions. The specific methods are as follows: 1. **Data Augmentation Algorithm**: - Generate new data points from a given dataset with statistically independent labels. - Use Singular Value Decomposition (SVD) to solve equations and generate new spectral data. - Add artificial noise to match the noise characteristics of the generated samples. 2. **Data Synthesis**: - Use Non-negative Matrix Factorization (NMF) to generate synthetic spectra from real spectra and offline measurement data. - Use mechanistic models to generate synthetic cultivation data with different monomer compositions. 3. **Evaluation Setup**: - Train and validate the model using synthetic datasets, testing the model's performance under different conditions. - Evaluate the model's generalization ability by comparing the Mean Squared Error (MSE) on different datasets. ### Conclusion The data augmentation method proposed in the paper effectively removes the correlation in training data, improving the model's generalization ability and robustness under different experimental conditions. This allows old cultivation data to be used as training data for new experimental setups, reducing the number of cultivations required for new experiments. The method is not only applicable to Raman Spectroscopy but can also be extended to other spectroscopic methods and different types of substances.