DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

Xiaoxiao He,Ligong Han,Quan Dao,Song Wen,Minhao Bai,Di Liu,Han Zhang,Martin Renqiang Min,Felix Juefei-Xu,Chaowei Tan,Bo Liu,Kang Li,Hongdong Li,Junzhou Huang,Faez Ahmed,Akash Srivastava,Dimitris Metaxas
2024-10-11
Abstract:Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces. For project webpage, see <a class="link-external link-https" href="https://hexiaoxiao-cs.github.io/DICE/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is the limitation of discrete diffusion models in controllable content editing. Although discrete diffusion models have achieved success in tasks such as image generation and masked language modeling, they still face challenges in content editing that requires fine control. For example, existing mask-based generative models lack the ability to inject information from the masked region into the inpainting process when editing images through masked areas, resulting in limited fine-grained control over the editing outcomes. To address these issues, the paper proposes DICE (Discrete Inversion for Controllable Editing), the first precise inversion algorithm applicable to discrete diffusion models (including multinomial diffusion and masked generative models). DICE achieves accurate reconstruction and flexible editing of discrete data during the reverse diffusion process by recording noise sequences and mask patterns, without the need for predefined masks or attention operations. This provides new opportunities for fine-grained content manipulation in discrete spaces, improving data fidelity and enhancing editing capabilities. Specifically, the core methodology of DICE lies in recording the noise sequences required during the reverse diffusion process to recover the random trajectories and re-adding these recorded residuals during editing or inference, thereby allowing the injection and control of the amount of information introduced into the inference process. This approach not only enables accurate reconstruction of the original input data but also allows for controllable editing without the need for predefined masks or attention operations, providing a flexible framework for content manipulation in discrete spaces.