An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan,Tracianne B. Neilsen,Benjamin L. Francis,Alex M. Stankovic,Mingjian Wen,Ilia Nikiforov,Ellad B. Tadmor,Vasily V. Bulatov,Vincenzo Lordi,Mark K. Transtrum
2024-11-05
Abstract:The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.
Machine Learning,Materials Science,Applied Physics,Computational Physics,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
The problem this paper attempts to address is how to select the most informative training data to improve the predictive performance of mathematical models when data collection is costly and difficult. Specifically, the paper proposes an information matching method based on the Fisher Information Matrix (FIM) to select the minimal training dataset required to accurately constrain the parameters needed for the Quantities of Interest (QoI). This method not only improves data efficiency but also enhances the interpretability of the model. ### Main Issues 1. **Cost and difficulty of data collection**: High-quality data is crucial for training mathematical models, but collecting sufficient data is often expensive and challenging. 2. **Difference between parameter estimation and prediction accuracy**: In many applications, the ultimate goal of the model is to predict certain key quantities (QoI) rather than to accurately estimate all parameters. Therefore, traditional methods that optimize parameter accuracy may not always be the most effective. 3. **Model interpretability and numerical stability**: In some cases, combinations of parameters in the model are unidentifiable (i.e., "sloppy" parameters), leading to numerical instability in traditional methods. ### Solution The paper proposes an information matching method based on the Fisher Information Matrix, implemented through the following steps: 1. **Define target accuracy**: First, determine the accuracy required for the target QoI. 2. **Select training data**: Use a convex optimization problem to select the minimal training dataset that ensures the data contains sufficient information to accurately constrain the parameters needed for the target QoI. 3. **Optimize parameters**: Through an iterative active learning process, gradually optimize the parameters to ensure the model can achieve the target accuracy with limited data. ### Application Examples 1. **Power system sensor placement**: Achieve accurate observation of the entire network state by selecting the optimal placement of PMUs (Phasor Measurement Units). 2. **Ocean acoustic source localization**: Achieve accurate estimation of the source location in shallow water by selecting the optimal positions for acoustic receivers. 3. **Development of interatomic potentials in materials science**: Develop accurate Stillinger-Weber potential functions for predicting material properties by selecting the optimal atomic configurations. ### Summary The information matching method proposed in this paper not only improves data utilization efficiency but also enhances the interpretability and numerical stability of the model. This method demonstrates broad application prospects in various scientific fields, including power systems, ocean acoustics, and materials science.