Online Posterior Sampling with a Diffusion Prior

Branislav Kveton,Boris Oreshkin,Youngsuk Park,Aniket Deshmukh,Rui Song
2024-10-05
Abstract:Posterior sampling in contextual bandits with a Gaussian prior can be implemented exactly or approximately using the Laplace approximation. The Gaussian prior is computationally efficient but it cannot describe complex distributions. In this work, we propose approximate posterior sampling algorithms for contextual bandits with a diffusion model prior. The key idea is to sample from a chain of approximate conditional posteriors, one for each stage of the reverse process, which are estimated in a closed form using the Laplace approximation. Our approximations are motivated by posterior sampling with a Gaussian prior, and inherit its simplicity and efficiency. They are asymptotically consistent and perform well empirically on a variety of contextual bandit problems.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use the diffusion model as a prior for approximate posterior sampling in contextual bandits. Although the traditional Gaussian prior is computationally efficient, it cannot describe complex distributions, especially when it is necessary to represent multimodal distributions. Therefore, this paper proposes a new posterior sampling algorithm based on the diffusion model prior, aiming to overcome the limitations of the Gaussian prior and show good performance on various contextual bandit problems. ### Main Contributions: 1. **Algorithm Innovation**: Proposed new posterior sampling approximation methods applicable to linear models and generalized linear models (GLMs). These methods are achieved by sampling from a series of approximate conditional posteriors, and each conditional posterior is a closed - form estimated by Laplace approximation. 2. **Theoretical Contribution**: Proved that the proposed posterior approximation method is asymptotically consistent. Specifically, as the number of observations increases, the conditional posterior concentrates on a scaled version of the unknown model parameters. 3. **Experimental Verification**: Conducted experimental verification on various contextual bandit problems, demonstrating the effectiveness and robustness of the proposed method, especially outperforming existing methods when dealing with complex priors. ### Method Overview: - **Linear Model**: At each stage, the conditional posterior is the product of two Gaussian distributions, representing prior knowledge and diffusion evidence respectively. - **Generalized Linear Model**: Through Laplace approximation, mix prior knowledge and evidence to obtain the conditional posterior at each stage. - **Diffusion Model**: Use the diffusion model as a prior and generate and sample data through the forward process and the reverse process. ### Experimental Results: - **Synthetic Experiments**: On three synthetic problems, the performance of the proposed method (DiffTS) is better than that of the baseline methods, especially when dealing with multimodal priors. - **MovieLens Experiments**: In the recommendation system task, the proposed method shows significantly lower cumulative regret in both linear and logistic regression multi - armed bandit problems. ### Related Work: - **Multi - armed Bandits with Diffusion Models**: Hsieh et al. proposed the Thompson sampling method for K - armed bandits, while this paper extends to more general contextual multi - armed bandits and develops new posterior sampling approximation methods. - **Posterior Sampling in Diffusion Models**: Chung et al. proposed the DPS method, which samples the posterior by adding the gradient of the observation likelihood, but this method becomes unstable as the number of observations increases. The method in this paper avoids this problem by approximating through the product of the prior and evidence distributions. In conclusion, this paper proposes a new posterior sampling method by introducing the diffusion model prior, which not only theoretically proves its asymptotic consistency but also experimentally verifies its effectiveness and robustness.