Generative Powers of Ten

Xiaojuan Wang,Janne Kontkanen,Brian Curless,Steve Seitz,Ira Kemelmacher,Ben Mildenhall,Pratul Srinivasan,Dor Verbin,Aleksander Holynski
2024-05-22
Abstract:We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate a consistent and high - quality image sequence at different zoom levels, especially to achieve extreme semantic zooming from macroscopic to microscopic. Traditional methods such as super - resolution and image out - expansion have limitations when dealing with such multi - scale generation tasks because they usually rely on the content of the original image to determine the details of subsequent zoom levels, and in the case of extreme zooming, these methods may not be able to generate new structures or content. Specifically, the paper proposes a joint multi - scale diffusion sampling method. By using a pre - trained text - to - image diffusion model, it generates content that is consistent across multiple image scales, thereby achieving a continuously zooming video from a wide - angle landscape view to a microscopic close - up shot. This method can not only generate reasonable images at each scale, but also ensure the content consistency between different scales, solving the challenges encountered by existing methods in dealing with cross - scale content generation. The main contributions of the paper include: 1. **Multi - scale joint sampling**: A new joint multi - scale diffusion sampling algorithm is proposed. The parallel diffusion sampling processes are distributed at different zoom levels, and these sampling processes are coordinated through an iterative frequency band integration process to ensure the content consistency between different scales. 2. **Zoom stack representation**: A new zoom stack representation method is introduced, which can render an image at any given zoom level and ensure the content consistency between different zoom levels. 3. **Multi - resolution fusion**: A multi - resolution fusion technique is developed to integrate the observations at different zoom levels into a consistent zoom stack, avoiding blurring and aliasing problems. 4. **Photo - based zooming**: In addition to generating the entire zoom stack from scratch, a method for generating a zoom sequence based on an existing photo is also proposed. By optimizing the loss function, it is ensured that the generated image matches the input photo when the zoom is consistent. Through these methods, the paper demonstrates its superior performance in generating consistent multi - scale content, especially in dealing with the generation of new structures and content in the case of extreme zooming.