How more data can hurt: Instability and regularization in next-generation reservoir computing

Yuanzhao Zhang,Sean P. Cornelius
2024-07-12
Abstract:It has been found recently that more data can, counter-intuitively, hurt the performance of deep neural networks. Here, we show that a more extreme version of the phenomenon occurs in data-driven models of dynamical systems. To elucidate the underlying mechanism, we focus on next-generation reservoir computing (NGRC) -- a popular framework for learning dynamics from data. We find that, despite learning a better representation of the flow map with more training data, NGRC can adopt an ill-conditioned ``integrator'' and lose stability. We link this data-induced instability to the auxiliary dimensions created by the delayed states in NGRC. Based on these findings, we propose simple strategies to mitigate the instability, either by increasing regularization strength in tandem with data size, or by carefully introducing noise during training. Our results highlight the importance of proper regularization in data-driven modeling of dynamical systems.
Machine Learning,Neural and Evolutionary Computing,Dynamical Systems,Adaptation and Self-Organizing Systems
What problem does this paper attempt to address?
The paper discusses how an excessive amount of data can counterintuitively impair the performance of Next-Generation Reservoir Computing (NGRC) models, especially when predicting dynamic systems, which can result in long-term instability. NGRC is a popular framework for learning dynamic systems from data. The study found that although using more training data can improve the quality of the flow representation, NGRC may adopt poorly conditioned "integrators," leading to instability. This data-induced instability is related to the auxiliary dimensions created by the delayed states in NGRC. The paper presents a case study using a magnetic pendulum system to demonstrate that as the number of training trajectories increases, the NGRC model can more accurately capture complex attractor basins. However, after reaching a certain threshold, even if the model is stable with fewer data, it becomes unstable and all predicted trajectories diverge to infinity. This instability is related to the strength of regularization, but more data requires more aggressive regularization to delay the occurrence of instability. The paper rules out overfitting of the flow surfaces as a possible cause of instability and explains the instability from a numerical analysis perspective, treating the NGRC model as an integrator. As the training data volume increases, the integrator learned by the NGRC model becomes increasingly unstable, reflected in the increase of the condition number κ of the readout matrix. The paper also proposes two mitigation strategies: first, increasing the regularization strength synchronously with the increase in data volume; second, carefully introducing noise during the training process. Furthermore, the paper provides a complementary explanation from a geometric perspective, indicating that NGRC actually learns the flow map in a higher-dimensional space. When it attempts to fit more data on the lower-dimensional submanifold of the real system, lateral instability occurs in other dimensions, leading to divergence behavior when the model is not perfectly fitted to the flow map or when the starting point is not on the submanifold. In summary, the paper reveals that appropriate regularization is crucial for avoiding data-induced instability in dynamic system modeling.