Discrete Probabilistic Inference as Control in Multi-path Environments

Tristan Deleu,Padideh Nouri,Nikolay Malkin,Doina Precup,Yoshua Bengio
2024-05-28
Abstract:We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the sampling problem in discrete structured distributions. Specifically, the authors focus on how to sample from discrete and structured distributions in a multi - path environment, so that the sampled objects can be generated according to the predefined reward ratio at the end of this sequential decision - making process. ### Detailed Explanation: 1. **Problem Background**: - In the fields of deep learning and reinforcement learning, sampling is an important method for generating data points from complex distributions. However, in discrete and highly structured sample spaces, traditional re - parameterization techniques become difficult because these techniques usually require continuous relaxation of discrete distributions. - Another common method is to sample through Markov Chain Monte Carlo (MCMC), but this requires the target distribution to have intractable normalization constants. 2. **Limitations of Existing Methods**: - Maximum Entropy Reinforcement Learning (MaxEnt RL) can be used for sampling some distributions, but in the case where there are multiple ways to generate the same object, the state distribution induced by the optimal policy may be biased. - Generative Flow Networks (GFlowNets) is a new probability model, aiming to overcome these problems by approximating flow conservation and ensuring that the proportion of sampled objects is proportional to their cumulative rewards. 3. **Contributions of the Paper**: - **Correcting the Reward Function**: The paper extends recent methods by correcting the reward function to ensure that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. - **Proof of Equivalence**: The authors prove that some flow - matching objectives in the GFlowNet literature are actually equivalent to the established MaxEnt RL algorithms with corrected rewards. - **Experimental Verification**: The performance of multiple MaxEnt RL and GFlowNet algorithms is studied through multiple problems involving sampling from discrete distributions. ### Formula Summary: - **Gibbs Distribution**: \[ P(x)\propto\exp\left(-\frac{E(x)}{\alpha}\right) \] where \(E(x)\) is the energy function and \(\alpha > 0\) is the temperature parameter. - **Objective of Maximum Entropy Reinforcement Learning**: \[ \pi^*_{\text{MaxEnt}}=\arg\max_{\pi}\mathbb{E}_{\tau}\left[\sum_{t = 0}^{T}r(s_t,s_{t + 1})+\alpha H(\pi(\cdot|s_t))\right] \] - **Corrected Reward Function**: \[ \sum_{t = 0}^{T}r(s_t,s_{t + 1})=-E(s_T)+\alpha\sum_{t = 0}^{T - 1}\log P_B(s_t|s_{t + 1}) \] Through these methods, the paper aims to provide a more effective and unbiased way to sample from complex discrete structured distributions.