Abstract:We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate a consistent and high - quality image sequence at different zoom levels, especially to achieve extreme semantic zooming from macroscopic to microscopic. Traditional methods such as super - resolution and image out - expansion have limitations when dealing with such multi - scale generation tasks because they usually rely on the content of the original image to determine the details of subsequent zoom levels, and in the case of extreme zooming, these methods may not be able to generate new structures or content. Specifically, the paper proposes a joint multi - scale diffusion sampling method. By using a pre - trained text - to - image diffusion model, it generates content that is consistent across multiple image scales, thereby achieving a continuously zooming video from a wide - angle landscape view to a microscopic close - up shot. This method can not only generate reasonable images at each scale, but also ensure the content consistency between different scales, solving the challenges encountered by existing methods in dealing with cross - scale content generation. The main contributions of the paper include: 1. **Multi - scale joint sampling**: A new joint multi - scale diffusion sampling algorithm is proposed. The parallel diffusion sampling processes are distributed at different zoom levels, and these sampling processes are coordinated through an iterative frequency band integration process to ensure the content consistency between different scales. 2. **Zoom stack representation**: A new zoom stack representation method is introduced, which can render an image at any given zoom level and ensure the content consistency between different zoom levels. 3. **Multi - resolution fusion**: A multi - resolution fusion technique is developed to integrate the observations at different zoom levels into a consistent zoom stack, avoiding blurring and aliasing problems. 4. **Photo - based zooming**: In addition to generating the entire zoom stack from scratch, a method for generating a zoom sequence based on an existing photo is also proposed. By optimizing the loss function, it is ensured that the generated image matches the input photo when the zoom is consistent. Through these methods, the paper demonstrates its superior performance in generating consistent multi - scale content, especially in dealing with the generation of new structures and content in the case of extreme zooming.

Generative Powers of Ten

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

SpatialGAN: Progressive Image Generation Based on Spatial Recursive Adversarial Expansion

Emage: Non-Autoregressive Text-to-Image Generation

TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation

Single Remote Sensing Image Super-Resolution Via a Generative Adversarial Network with Stratified Dense Sampling and Chain Training

Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

Towards Extreme Image Rescaling with Generative Prior and Invertible Prior

Generative Adversarial Models for Extreme Geospatial Downscaling

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Forest Single-Frame Remote Sensing Image Super-Resolution Using GANs

Generative Photomontage

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Generative View Synthesis: From Single-view Semantics to Novel-view Images

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

Zooming Out on Zooming In: Advancing Super-Resolution for Remote Sensing

Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution

MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation