Abstract:We present a mathematical framework and computational methods to optimally design a finite number of sequential experiments. We formulate this sequential optimal experimental design (sOED) problem as a finite-horizon partially observable Markov decision process (POMDP) in a Bayesian setting and with information-theoretic utilities. It is built to accommodate continuous random variables, general non-Gaussian posteriors, and expensive nonlinear forward models. sOED then seeks an optimal design policy that incorporates elements of both feedback and lookahead, generalizing the suboptimal batch and greedy designs. We solve for the sOED policy numerically via policy gradient (PG) methods from reinforcement learning, and derive and prove the PG expression for sOED. Adopting an actor-critic approach, we parameterize the policy and value functions using deep neural networks and improve them using gradient estimates produced from simulated episodes of designs and observations. The overall PG-sOED method is validated on a linear-Gaussian benchmark, and its advantages over batch and greedy designs are demonstrated through a contaminant source inversion problem in a convection-diffusion field.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the design problem of a series of optimal experiments in nonlinear models. Specifically, the paper proposes a mathematical framework and computational method for optimizing a finite number of continuous experimental designs in a Bayesian setting. These problems are formulated as a partially observable Markov decision process (POMDP) in a finite - time horizon, using an information - theoretic utility function to measure the value of the experiment. This method can handle continuous random variables, general non - Gaussian posterior distributions, and expensive nonlinear forward models. The main contributions of the paper are as follows:
1. **Problem Formulation**: Formulate the sequential optimal experimental design (sOED) problem as a finite - time - horizon POMDP in a Bayesian setting, applicable to continuous random variables, and demonstrate its generalization ability compared to batch and greedy designs.
2. **Algorithm Proposal**: Propose a policy - gradient (PG) - based sOED algorithm (called PG - sOED), derive and prove the key gradient expressions, and propose its Monte Carlo estimator. In addition, introduce the deep neural network (DNN) architectures for the policy and value functions, and describe in detail the numerical settings of the entire method.
3. **Performance Verification**: Verify the speed and optimality advantages of PG - sOED through a linear - Gaussian benchmark test and an inverse problem of pollution source in a convection - diffusion field, which involve expensive forward models.
### Main Contributions
1. **Problem Formulation**:
- Formulate the sOED problem as a finite - time - horizon POMDP in a Bayesian setting, applicable to continuous random variables.
- Demonstrate the generalization ability of sOED compared to batch and greedy designs.
2. **Algorithm Proposal**:
- Propose a policy - gradient (PG) - based sOED algorithm (PG - sOED).
- Derive and prove the key gradient expressions and propose its Monte Carlo estimator.
- Use deep neural networks (DNN) to parameterize and approximate the policy and value functions.
- Adopt the actor - critic method to explicitly represent and learn the policy, thereby allowing the use of gradient - based optimization algorithms.
3. **Performance Verification**:
- Verify the effectiveness of PG - sOED through a linear - Gaussian benchmark test.
- Demonstrate the advantages of PG - sOED compared to batch and greedy designs through an inverse problem of pollution source in a convection - diffusion field.
### Mathematical Formulas
- **Bayesian Update Formula**:
\[
p(\theta | d_k, y_k, I_k) = \frac{p(y_k | \theta, d_k, I_k) p(\theta | I_k)}{p(y_k | d_k, I_k)}
\]
where \( I_k=\{d_0, y_0, \ldots, d_{k - 1}, y_{k - 1}\} \) is all experimental designs and observation records before the \( k \) - th experiment.
- **KL Divergence as a Reward Function**:
- Terminal Reward Form:
\[
g_N(x_N)=D_{\text{KL}}(p(\cdot | I_N)\|p(\cdot | I_0))=\int_\Theta p(\theta | I_N)\ln\left(\frac{p(\theta | I_N)}{p(\theta | I_0)}\right)d\theta
\]
- Incremental Reward Form:
\[
g_k(x_k, d_k, y_k)=D_{\text{KL}}(p(\cdot | I_{k + 1})\|p(\cdot | I_k))=\int_\Theta p(\theta | I_{k + 1})\ln\left(\frac{p(\theta | I_{k + 1})}{p(\theta | I_k)}\right)d\theta
\]
- **Policy Gradient Expression**:
\[