Abstract:We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the sampling problem in discrete structured distributions. Specifically, the authors focus on how to sample from discrete and structured distributions in a multi - path environment, so that the sampled objects can be generated according to the predefined reward ratio at the end of this sequential decision - making process. ### Detailed Explanation: 1. **Problem Background**: - In the fields of deep learning and reinforcement learning, sampling is an important method for generating data points from complex distributions. However, in discrete and highly structured sample spaces, traditional re - parameterization techniques become difficult because these techniques usually require continuous relaxation of discrete distributions. - Another common method is to sample through Markov Chain Monte Carlo (MCMC), but this requires the target distribution to have intractable normalization constants. 2. **Limitations of Existing Methods**: - Maximum Entropy Reinforcement Learning (MaxEnt RL) can be used for sampling some distributions, but in the case where there are multiple ways to generate the same object, the state distribution induced by the optimal policy may be biased. - Generative Flow Networks (GFlowNets) is a new probability model, aiming to overcome these problems by approximating flow conservation and ensuring that the proportion of sampled objects is proportional to their cumulative rewards. 3. **Contributions of the Paper**: - **Correcting the Reward Function**: The paper extends recent methods by correcting the reward function to ensure that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. - **Proof of Equivalence**: The authors prove that some flow - matching objectives in the GFlowNet literature are actually equivalent to the established MaxEnt RL algorithms with corrected rewards. - **Experimental Verification**: The performance of multiple MaxEnt RL and GFlowNet algorithms is studied through multiple problems involving sampling from discrete distributions. ### Formula Summary: - **Gibbs Distribution**: \[ P(x)\propto\exp\left(-\frac{E(x)}{\alpha}\right) \] where \(E(x)\) is the energy function and \(\alpha > 0\) is the temperature parameter. - **Objective of Maximum Entropy Reinforcement Learning**: \[ \pi^*_{\text{MaxEnt}}=\arg\max_{\pi}\mathbb{E}_{\tau}\left[\sum_{t = 0}^{T}r(s_t,s_{t + 1})+\alpha H(\pi(\cdot|s_t))\right] \] - **Corrected Reward Function**: \[ \sum_{t = 0}^{T}r(s_t,s_{t + 1})=-E(s_T)+\alpha\sum_{t = 0}^{T - 1}\log P_B(s_t|s_{t + 1}) \] Through these methods, the paper aims to provide a more effective and unbiased way to sample from complex discrete structured distributions.

Discrete Probabilistic Inference as Control in Multi-path Environments

Stochastic Generative Flow Networks

Generative Flow Networks as Entropy-Regularized RL

Generative Flow Networks: a Markov Chain Perspective

Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

Rectifying Reinforcement Learning for Reward Matching

Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

Discrete Flow Matching

Sampling with flows, diffusion, and autoregressive neural networks from a spin-glass perspective

Generative Assignment Flows for Representing and Learning Joint Distributions of Discrete Data

Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization

Bifurcated Generative Flow Networks

On Generalization for Generative Flow Networks

Learning GFlowNets from partial episodes for improved convergence and stability

GFlowNets for AI-Driven Scientific Discovery

Streaming Bayes GFlowNets

Amortizing intractable inference in diffusion models for vision, language, and control

Better Training of GFlowNets with Local Credit and Incomplete Trajectories

Control as Probabilistic Inference as an Emergent Communication Mechanism in Multi-Agent Reinforcement Learning

Embarrassingly Parallel GFlowNets