Model averaging approaches to data subset selection

Ethan T. Neil,Jacob W. Sitison
DOI: https://doi.org/10.1103/PhysRevE.108.045308
2023-10-25
Abstract:Model averaging is a useful and robust method for dealing with model uncertainty in statistical analysis. Often, it is useful to consider data subset selection at the same time, in which model selection criteria are used to compare models across different subsets of the data. Two different criteria have been proposed in the literature for how the data subsets should be weighted. We compare the two criteria closely in a unified treatment based on the Kullback-Leibler divergence, and conclude that one of them is subtly flawed and will tend to yield larger uncertainties due to loss of information. Analytical and numerical examples are provided.
Methodology,High Energy Physics - Lattice
What problem does this paper attempt to address?
This paper primarily explores the issue of effectively selecting data subsets in statistical analysis, with a particular focus on the application of model averaging methods in this context. The paper compares and analyzes two different information criteria (AICsub and AICperf) to evaluate the effectiveness of different data subsets. Specifically, the core issues addressed in the paper are: 1. **Comparing Model Uncertainty**: How to reasonably consider the selection of data subsets when dealing with model uncertainty through model averaging methods. 2. **Evaluating Data Subsets**: Proposing two different information criteria (AICsub and AICperf) to assess which data subsets are more suitable for fitting the model. 3. **Identifying the Optimal Criterion**: Determining which criterion (AICsub or AICperf) can more effectively avoid information loss and yield more reliable results through theoretical analysis and numerical experiments. The paper first introduces the basic concepts of model averaging and the background knowledge of the Akaike Information Criterion (AIC), then introduces two methods for evaluating data subsets and explains in detail their derivation process and underlying theoretical basis. Next, through theoretical analysis and numerical examples, it demonstrates the potential issues with AICsub—it tends to select subsets with fewer data points, which may lead to increased uncertainty in statistical estimation. In contrast, the AICperf method can better retain information, resulting in more accurate statistical estimates. In summary, this paper aims to find a better method to improve the effectiveness of model averaging in data subset selection by comparing and analyzing the information criteria for two data subset selections.