Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

Divya Kothandaraman,Ming Lin,Dinesh Manocha
2024-05-22
Abstract:We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts using pre-trained text-to-image diffusion models. At each time step during diffusion denoising, our algorithm forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions. To do so, we leverage the connection between diffusion models (rooted in non-equilibrium thermodynamics) and the Black-Scholes model for pricing options in Finance, and draw analogies between the variables in both contexts to derive an appropriate algorithm for prompt mixing using the Black Scholes model. Specifically, the parallels between diffusion models and the Black-Scholes model enable us to leverage properties related to the dynamics of the Markovian model derived in the Black-Scholes algorithm. Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training. Furthermore, it operates without human intervention or hyperparameter tuning. We highlight the benefits of our approach by comparing it qualitatively and quantitatively to other prompt mixing techniques, including linear interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided prompt selection across various scenarios such as single object per text prompt, multiple objects per text prompt and objects against backgrounds. Code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the problem of achieving more efficient prompt mixing in text-to-image diffusion models to generate images that simultaneously conform to multiple text prompts. Specifically, the authors propose a new method based on the Black-Scholes model, utilizing the Black-Scholes model from finance to dynamically select the most relevant text prompt at each denoising step, thereby generating a high-quality image that balances all given text prompt features. ### Main Contributions: 1. **Algorithm Design**: A new prompt mixing algorithm is proposed, which evaluates the "cost" associated with each relevant text prompt at each step of the diffusion denoising process and selects the optimal condition based on the cost that needs to be maximized. 2. **Theoretical Foundation**: The concept of diffusion models is linked with the Black-Scholes model from the financial domain, using the concept of Markovian time series to predict the score of each text prompt. 3. **Performance Validation**: Qualitative and quantitative experiments demonstrate the advantages of this method compared to several baseline methods (such as linear interpolation, alternating prompts, stepwise switching, etc.). ### Experimental Results: - The effectiveness of this method is validated through various experimental settings (single object, multiple objects, background actions, etc.). - Under different experimental settings, this method outperforms other methods in both CLIP-combined and CLIP-add scores. ### Future Work Directions: - Explore more advanced image-language models to improve evaluation metrics. - Study the effect of prompt mixing with more text prompts. - Extend this method to non-Gaussian diffusion models or one-step diffusion models. In summary, this paper aims to improve the performance of existing text-to-image diffusion models in the task of prompt mixing by introducing the Black-Scholes model, thereby generating high-quality images that better conform to multiple text descriptions.