Abstract:Periodic changes in the concentration or activity of different molecules regulate vital cellular processes such as cell division and circadian rhythms. Developing mathematical models is essential to better understand the mechanisms underlying these oscillations. Recent data-driven methods like SINDy have fundamentally changed model identification, yet their application to experimental biological data remains limited. This study investigates SINDy’s constraints by directly applying it to biological oscillatory data. We identify insufficient resolution, noise, dimensionality, and limited prior knowledge as primary limitations. Using various generic oscillator models of different complexity and/or dimensionality, we systematically analyze these factors. We then propose a comprehensive guide for inferring models from biological data, addressing these challenges step by step. Our approach is validated using glycolytic oscillation data from yeast.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use the SINDy (Sparse Identification of Nonlinear Dynamics) method to identify the mathematical model describing the oscillatory system from experimental biological data. Although SINDy performs excellently in dealing with synthetic data, it encounters some challenges and limitations when applied to actual biological data. Therefore, the author hopes to propose a systematic method to improve the performance of SINDy in processing biological oscillatory system data by studying these limitations.
### Specific problems include:
1. **Insufficient resolution**:
- The resolution of biological data is usually low, especially in time - series data. This makes it difficult for SINDy to capture the dynamic changes of the system, especially for those complex oscillatory systems with alternating fast and slow phases.
- **Formula representation**:
\[
\text{Resolution}=\frac{\text{Number of sampling points}}{\text{Time range}}
\]
Low resolution means a small number of sampling points, resulting in an inability to accurately describe the dynamic behavior of the system.
2. **High noise**:
- Experimental data usually contains high noise, which has a negative impact on the performance of SINDy. SINDy fits the time derivative \(X_t\) rather than the original time series, so the noise will be further amplified.
- **Formula representation**:
\[
X_t = \frac{dX}{dt}
\]
The presence of noise will lead to inaccurate estimation of \(X_t\), thus affecting model identification.
3. **Number of dimensions**:
- SINDy requires all relevant state variable information to model correctly. However, in many biological systems, we may not be able to measure all relevant variables, or we may measure too many or too few variables.
- **Formula representation**:
\[
\text{State variables}=\{x_1,x_2,\ldots,x_n\}
\]
If some key variables are missing, the model will not be able to accurately describe the dynamic behavior of the system.
4. **Limited prior knowledge**:
- In many cases, our prior knowledge of the system is very limited, for example, we do not know the specific interactions between certain molecules. This lack of prior knowledge will make model identification more difficult.
- **Formula representation**:
\[
\text{Prior knowledge}=\{\text{Known interactions},\text{Physical laws}\}
\]
### Solutions:
To address the above challenges, the author proposes a comprehensive guide to solve these problems step by step and verifies the effectiveness of this method through yeast glycolytic oscillation data. Specific steps include:
- **Improve data resolution**: By improving the sampling strategy or using a higher - frequency data acquisition device.
- **Noise reduction treatment**: Apply noise filtering techniques, such as neural networks or other filters, to reduce the impact of noise on model identification.
- **Dimension reconstruction**: Use methods such as auto - encoders or delay embedding to supplement missing state variables.
- **Combine prior knowledge**: Make as much use as possible of existing biological knowledge and physical laws to guide model construction.
Through these methods, the author hopes to improve the performance of SINDy in processing biological oscillatory system data, so as to better understand the dynamic behavior of these systems.