Gabriel Missael Barco,Alexandre Adam,Connor Stone,Yashar Hezaveh,Laurence Perreault-Levasseur
Abstract:Bayesian inference for inverse problems hinges critically on the choice of priors. In the absence of specific prior information, population-level distributions can serve as effective priors for parameters of interest. With the advent of machine learning, the use of data-driven population-level distributions (encoded, e.g., in a trained deep neural network) as priors is emerging as an appealing alternative to simple parametric priors in a variety of inverse problems. However, in many astrophysical applications, it is often difficult or even impossible to acquire independent and identically distributed samples from the underlying data-generating process of interest to train these models. In these cases, corrupted data or a surrogate, e.g. a simulator, is often used to produce training samples, meaning that there is a risk of obtaining misspecified priors. This, in turn, can bias the inferred posteriors in ways that are difficult to quantify, which limits the potential applicability of these models in real-world scenarios. In this work, we propose addressing this issue by iteratively updating the population-level distributions by retraining the model with posterior samples from different sets of observations and showcase the potential of this method on the problem of background image reconstruction in strong gravitational lensing when score-based models are used as data-driven priors. We show that starting from a misspecified prior distribution, the updated distribution becomes progressively closer to the underlying population-level distribution, and the resulting posterior samples exhibit reduced bias after several updates.
Instrumentation and Methods for Astrophysics,Cosmology and Nongalactic Astrophysics,Machine Learning
What problem does this paper attempt to address?
### The problems the paper attempts to solve
The paper aims to solve the distribution shift problem that occurs when using data - driven prior distributions in inverse problems. Specifically, when independent and identically distributed samples cannot be directly obtained from the data - generation process of interest, corrupted data or substitutes (such as simulators) are usually used to generate training samples. This may lead to mis - specification of the prior distribution, which in turn causes biases in the inferred posterior distribution. These biases are difficult to quantify, limiting the application potential of these models in real - world scenarios.
To solve this problem, the authors propose an iterative method to update the population - level distribution by retraining the model. The specific steps are as follows:
1. **Initial prior distribution**: Train an initial prior distribution using a potentially corrupted dataset.
2. **Posterior sampling**: Generate posterior samples using the observed data.
3. **Update prior distribution**: Retrain the model using these posterior samples to update the prior distribution.
4. **Iterative process**: Repeat the above steps to gradually make the prior distribution approach the true population - level distribution.
### Method overview
#### 2.1 Score - based Models
Score - based models (SBM) are a class of generative models that aim to learn the score function \(\nabla_x \log p_t(x)\) after the convolution of the data distribution and noise. The model trains a neural network \(s_\theta(x, t)\) by minimizing the denoising score - matching objective:
\[ L_\theta = \mathbb{E}_{x \sim D, t \sim U(0,1), x_t \sim p(x_t | x)} \left[ \lambda(t) \| s_\theta(x_t, t) - \nabla_{x_t} \log p_t(x_t | x) \|^2 \right] \]
where \(\lambda(t)\) is a weight function and \(p(x_t | x)\) is a Gaussian perturbation kernel.
#### 2.2 Score prior in linear inverse problems
A linear inverse problem is described by the equation \(y = Ax + \eta\), where \(x\) is the parameter of interest, \(y\) is the observed value, and \(\eta\) is additive noise. In Bayesian inference, the goal is to sample from the posterior distribution \(p(x | y)\). When using a score model, this can be achieved by replacing the prior score function with the posterior score function.
#### 2.3 Updating the prior from the observed data
Suppose the initial SBM prior is trained on a potentially corrupted dataset \(\{x_i^{(0)}\}\), and the goal is to update the population - level parameter \(\theta\) given a set of noisy/partially observed data \(\{y_i\}\). The specific method is as follows:
1. **Generate posterior samples**: For each observation \(y_i\), generate \(K\) posterior samples \(\{x_{i,j}^{(\alpha)}\}\).
2. **Train new prior**: Use these posterior samples to train a new prior distribution \(p_{\theta^{\alpha+1}}(x)\).
### Experiments and results
#### 4.1 MNIST: Pattern mismatch
In the experiment, the initial prior \(p_{\theta_0}(x)\) was trained on a subset of MNIST with digits 1 and 4 removed, while the true population distribution \(p_{\theta^\star}(x)\) contains data with digits 1 and 6 removed. After 4 iterations, the model successfully "forgot" digit 6 and gradually learned digit 4, even though this digit was not included in the initial training.
#### 4.2 Galaxies: Distribution shift in high - dimensional space
In a more realistic setting, the experiment tested a method for recovering undistorted galaxy images under strong gravitational lensing effects. The initial prior \(p_{\theta_0}(x)\) was trained on an elliptical galaxy dataset, while the true prior \(p_{\theta^\star}(x)\) was trained on a spiral galaxy dataset. Through multiple iterations, the model successfully discovered new features (such as spiral arms) and gradually approached the true.