Abstract:The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

What problem does this paper attempt to address?

The problem this paper attempts to address is how to select the most informative training data to improve the predictive performance of mathematical models when data collection is costly and difficult. Specifically, the paper proposes an information matching method based on the Fisher Information Matrix (FIM) to select the minimal training dataset required to accurately constrain the parameters needed for the Quantities of Interest (QoI). This method not only improves data efficiency but also enhances the interpretability of the model. ### Main Issues 1. **Cost and difficulty of data collection**: High-quality data is crucial for training mathematical models, but collecting sufficient data is often expensive and challenging. 2. **Difference between parameter estimation and prediction accuracy**: In many applications, the ultimate goal of the model is to predict certain key quantities (QoI) rather than to accurately estimate all parameters. Therefore, traditional methods that optimize parameter accuracy may not always be the most effective. 3. **Model interpretability and numerical stability**: In some cases, combinations of parameters in the model are unidentifiable (i.e., "sloppy" parameters), leading to numerical instability in traditional methods. ### Solution The paper proposes an information matching method based on the Fisher Information Matrix, implemented through the following steps: 1. **Define target accuracy**: First, determine the accuracy required for the target QoI. 2. **Select training data**: Use a convex optimization problem to select the minimal training dataset that ensures the data contains sufficient information to accurately constrain the parameters needed for the target QoI. 3. **Optimize parameters**: Through an iterative active learning process, gradually optimize the parameters to ensure the model can achieve the target accuracy with limited data. ### Application Examples 1. **Power system sensor placement**: Achieve accurate observation of the entire network state by selecting the optimal placement of PMUs (Phasor Measurement Units). 2. **Ocean acoustic source localization**: Achieve accurate estimation of the source location in shallow water by selecting the optimal positions for acoustic receivers. 3. **Development of interatomic potentials in materials science**: Develop accurate Stillinger-Weber potential functions for predicting material properties by selecting the optimal atomic configurations. ### Summary The information matching method proposed in this paper not only improves data utilization efficiency but also enhances the interpretability and numerical stability of the model. This method demonstrates broad application prospects in various scientific fields, including power systems, ocean acoustics, and materials science.

An information-matching approach to optimal experimental design and active learning

Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities

Optimal design of experiments in the context of machine-learning inter-atomic potentials: improving the efficiency and transferability of kernel based methods

New Balanced Active Learning Model and Optimization Algorithm.

A reinforced learning approach to optimal design under model uncertainty

Active Learning with Statistical Models

Bayesian Adaptive Calibration and Optimal Design

Modeling and Active Learning for Experiments with Quantitative-Sequence Factors

Active Learning for Discrete Latent Variable Models

Efficient Biological Data Acquisition through Inference Set Design

Physics-Based Active Learning for Design Space Exploration and Surrogate Construction for Multiparametric Optimization

Accurate, scalable, and efficient Bayesian optimal experimental design with derivative-informed neural operators

Active Learning Approach to Optimization of Experimental Control

Manifold Optimal Experimental Design Via Dependence Maximization for Active Learning

Bayesian Active Learning for Discrete Latent Variable Models

Optimal Experimental Design for Universal Differential Equations

Learning to Match via Inverse Optimal Transport

Optimal design of experiments to identify latent behavioral types

Benchmarking Active Learning Strategies for Materials Optimization and Discovery

Information theoretic approach to interactive learning

On Statistical Efficiency in Learning