Abstract:Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there have not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to improve dynamic generation models (such as Flow Matching and denoising diffusion models) through reward fine - tuning, so that the samples they generate are more in line with the desired quality and realism. Specifically, the authors aim to solve the theoretical problems existing in the existing methods, especially the value function bias problem, and propose a new algorithm - Adjoint Matching - to achieve better consistency, authenticity and generalization ability. ### Detailed Interpretation #### Research Background Dynamic generation models (such as Flow Matching and denoising diffusion models) have been widely used in many generative modeling applications, including text - to - image, text - to - video and text - to - audio tasks. However, these basic generation models usually cannot reach the required sample quality. In order to improve the quality of generated samples, a common practice is to use human - preference reward models for fine - tuning, but theoretical methods in this area are rare. #### Main Challenges 1. **Value Function Bias Problem**: When the existing fine - tuning methods introduce KL regularization, it will cause the generation distribution to deviate from the target tilted distribution \( p^*(x)\propto p_{\text{base}}(x)\exp(r(x)) \), because the dependence between the initial noise variable \( X_0 \) and the generated sample \( X_1 \) leads to bias. 2. **Lack of a Unified Framework**: Different types of dynamic generation models (such as Flow Matching and denoising diffusion models) lack a unified fine - tuning framework. #### Solutions 1. **Memoryless Noise Schedule**: - The authors prove that a specific memoryless noise schedule can eliminate the dependence between \( X_0 \) and \( X_1 \), thereby ensuring that the generated samples converge to the target tilted distribution. - Specifically, the memoryless noise schedule \( \sigma(t)=\sqrt{2\eta_t} \) can make the generation process satisfy the independence condition \( p_{\text{base}}(X_0, X_1)=p_{\text{base}}(X_0)p_{\text{base}}(X_1) \), thus avoiding the value function bias. 2. **Adjoint Matching Algorithm**: - A new algorithm, Adjoint Matching, is proposed to solve the stochastic optimal control problem. - This algorithm combines the continuous adjoint method and the least - squares regression objective, and has higher scalability and simplicity. - Adjoint Matching can be applied to general types of stochastic optimal control problems, and performs well in experiments, being able to significantly improve the quality and diversity of generated samples. #### Experimental Results - The authors conducted extensive experimental comparisons and analyzed the performance of different methods from multiple perspectives (such as authenticity, consistency and diversity). - The results show that the Adjoint Matching method not only provides better text - to - sample consistency, but also can achieve good generalization performance on unseen human - preference reward models, while retaining the diversity of samples. ### Summary This paper solves the value function bias problem in the reward fine - tuning of dynamic generation models by introducing the memoryless noise schedule and the Adjoint Matching algorithm, providing a theoretically sound and practically effective fine - tuning method. This provides new ideas and tools for improving the quality and realism of generation models.

Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

Stochastic Optimal Control Matching

A Taxonomy of Loss Functions for Stochastic Optimal Control

Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control

Discrete Flow Matching

Scores as Actions: a framework of fine-tuning diffusion models by continuous-time reinforcement learning

Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

Pre-Training and Fine-Tuning Generative Flow Networks

What's the score? Automated Denoising Score Matching for Nonlinear Diffusions

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Inference-Time Alignment of Diffusion Models with Direct Noise Optimization

Training Free Guided Flow Matching with Optimal Control

Markovian Flow Matching: Accelerating MCMC with Continuous Normalizing Flows

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

Online Joint Fine-tuning of Multi-Agent Flows

Adapting to Mixing Time in Stochastic Optimization with Markovian Data

Feedback Efficient Online Fine-Tuning of Diffusion Models