Move Anything with Layered Scene Diffusion

Jiawei Ren,Mengmeng Xu,Jui-Chieh Wu,Ziwei Liu,Tao Xiang,Antoine Toisoul
2024-04-11
Abstract:Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to freely rearrange the object layout in an image. In particular, while using the diffusion model to generate high - quality images, it realizes the controllable generation and editing of scenes. Specifically, the paper proposes a method named SceneDiffusion, which aims to use the pre - trained text - to - image (T2I) diffusion model to generate scenes with movable, scalable and cloneable objects while maintaining the consistency and high - quality of the image content. This method solves the problem that the existing diffusion models cannot provide fine - grained spatial control due to the fixed forward process, enabling users to flexibly perform spatial editing operations on the objects in the image without additional training.