Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Christoph Lange,Isabel Thiele,Lara Santolin,Sebastian L. Riedel,Maxim Borisyak,Peter Neubauer,M. Nicolas Cruz Bournazou

2024-02-02

Abstract:In biotechnology Raman Spectroscopy is rapidly gaining popularity as a process analytical technology (PAT) that measures cell densities, substrate- and product concentrations. As it records vibrational modes of molecules it provides that information non-invasively in a single spectrum. Typically, partial least squares (PLS) is the model of choice to infer information about variables of interest from the spectra. However, biological processes are known for their complexity where convolutional neural networks (CNN) present a powerful alternative. They can handle non-Gaussian noise and account for beam misalignment, pixel malfunctions or the presence of additional substances. However, they require a lot of data during model training, and they pick up non-linear dependencies in the process variables. In this work, we exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels so that a network trained on such data exhibits low correlations between the model predictions. We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training. This data augmentation technique enables us to reuse spectra as training data for new contexts that exhibit different correlations. The additional data allows for building a better and more robust model. This is of interest in scenarios where large amounts of historical data are available but are currently not used for model training. We demonstrate the capabilities of the proposed method using synthetic spectra of Ralstonia eutropha batch cultivations to monitor substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations during of the experiments.

Machine Learning,Quantitative Methods

What problem does this paper attempt to address?

The paper attempts to address the issue of how to handle highly correlated annotations in data when using Raman Spectroscopy as a Process Analytical Technology (PAT) to monitor cell density, substrate, and product concentrations in biotechnology. Specifically, the paper focuses on how to generate additional data points through data augmentation methods to reduce the correlation between model predictions, thereby improving the generalization ability and robustness of Convolutional Neural Networks (CNNs) under different experimental conditions. ### Background and Problem 1. **Application of Raman Spectroscopy**: - Raman Spectroscopy is becoming increasingly popular in biotechnology because it can non-invasively measure cell density, substrate, and product concentrations. - Partial Least Squares (PLS) is typically used to infer variables of interest from spectra, but Convolutional Neural Networks (CNNs) are gradually being adopted due to their powerful predictive capabilities. 2. **Complexity of Biological Processes**: - Biological processes are highly complex, with many non-Gaussian noises, beam misalignments, pixel failures, or the influence of additional substances. - CNNs can handle these complexities but require a large amount of training data and are prone to capturing nonlinear dependencies in process variables. 3. **Problem of Data Correlation**: - During cultivation, data usually have strong correlations. For example, in batch cultivation, the substrate is negatively correlated with biomass. - This correlation allows the model to perform well under similar conditions but causes the prediction quality to drop rapidly when applied to different processes (e.g., fed-batch cultivation). ### Solution The paper proposes a data augmentation scheme to "erase" the correlation in training data, making the model applicable to a wider range of process conditions. The specific methods are as follows: 1. **Data Augmentation Algorithm**: - Generate new data points from a given dataset with statistically independent labels. - Use Singular Value Decomposition (SVD) to solve equations and generate new spectral data. - Add artificial noise to match the noise characteristics of the generated samples. 2. **Data Synthesis**: - Use Non-negative Matrix Factorization (NMF) to generate synthetic spectra from real spectra and offline measurement data. - Use mechanistic models to generate synthetic cultivation data with different monomer compositions. 3. **Evaluation Setup**: - Train and validate the model using synthetic datasets, testing the model's performance under different conditions. - Evaluate the model's generalization ability by comparing the Mean Squared Error (MSE) on different datasets. ### Conclusion The data augmentation method proposed in the paper effectively removes the correlation in training data, improving the model's generalization ability and robustness under different experimental conditions. This allows old cultivation data to be used as training data for new experimental setups, reducing the number of cultivations required for new experiments. The method is not only applicable to Raman Spectroscopy but can also be extended to other spectroscopic methods and different types of substances.

Data Augmentation Scheme for Raman Spectra with Highly Correlated Annotations

Generative data augmentation and automated optimization of convolutional neural networks for process monitoring

Application of Semi-Supervised Convolutional Neural Network Regression Model Based on Data Augmentation and Process Spectral Labeling in Raman Predictive Modeling of Cell Culture Processes

Enhanced Data Augmentation using GANs for Raman Spectra Classification

Data Augmentation of Spectral Data for Convolutional Neural Network (CNN) Based Deep Chemometrics

Combining Convolutional Neural Networks and On-Line Raman Spectroscopy for Monitoring the Cornu Caprae Hircus Hydrolysis Process.

Deep adversarial data augmentation for biomedical spectroscopy: Application to modelling Raman spectra of bone

Assessing the Performance of 1D-Convolution Neural Networks to Predict Concentration of Mixture Components from Raman Spectra

Multicomponent Raman Spectral Regression Using Complete and Incomplete Models and Convolutional Neural Networks.

Data Augmentation of Raman Spectral and Its Application Research Based on DCGAN

Deep learning data augmentation for Raman spectroscopy cancer tissue classification

Automated Data Generation for Raman Spectroscopy Calibrations in Multi-Parallel Mini Bioreactors

Application of Spectral Small-Sample Data Combined with a Method of Spectral Data Augmentation Fusion (Sda-Fusion) in Cancer Diagnosis

Convolutional Neural Networks and Raman Spectroscopy for Semi-Supervised Dataset Construction and Transfer Learning Applications in Real-Time Quantitative Detection

Neighbouring pixel data augmentation: a simple way to fuse spectral and spatial information for hyperspectral imaging data analysis

Self-Calibrated Dual Contrasting for Annotation-Efficient Bacteria Raman Spectroscopy Clustering and Classification

NIR spectroscopy—CNN‐enabled chemometrics for multianalyte monitoring in microbial fermentation

Enhancing Decision Confidence in AI using Monte Carlo Dropout for Raman Spectra Classification

Near Infrared Spectral Analysis Based on Data Augmentation Strategy and Convolutional Neural Network

Conditional Generative Adversarial Network for Spectral Recovery to Accelerate Single-Cell Raman Spectroscopic Analysis.

RamanNet: A generalized neural network architecture for Raman Spectrum Analysis