Abstract:Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems that current text - to - image generation models (such as Stable Diffusion 3 and FLUX) face when generating images with specific camera settings: 1. **Inability to accurately interpret camera settings**: Existing generation models are unable to understand and generate images consistent with specific camera settings (such as focal length, shutter speed, aperture, etc.). For example, when a user requests to generate the same scene at different focal lengths, the generated images will not only change the field of view (FoV), but also the objects in the scene (such as rocks and trees), resulting in inconsistent results. 2. **Difficulty in maintaining scene consistency**: When adjusting camera settings, the generation model often causes changes in the scene content, that is, elements such as buildings and people in the generated images will change, rather than simply reflecting the changes in camera settings. This makes the generated images lose their sense of reality and consistency. To solve these problems, the paper introduces the concept of **Generative Photography**, a new paradigm that aims to generate realistic images by controlling the intrinsic settings of the camera while maintaining scene consistency. Specifically, the paper proposes two key techniques: - **Dimensionality Lifting**: Elevate the multi - camera - setting image generation problem to a video generation problem, thereby separating the invariant scene description from the camera settings to ensure scene consistency. - **Contrastive Camera Learning**: By constructing a contrastive dataset and designing a contrastive camera encoder, the model can better understand and generate different camera effects. Through these methods, the paper aims to achieve precise control of various camera settings (such as shutter speed, aperture, focal length, and color temperature), while maintaining the scene consistency of the generated images, thereby generating more realistic images. ### Formula representation When describing camera settings, the paper involves some physical parameters, for example: - **Focal Length**: Represented by \( f \), with the unit of millimeter (mm). - **Shutter Speed**: Represented by \( t \), with the unit of second (s). - **Aperture**: Represented by \( A \), usually given in the form of \( f/\text{number} \). - **Color Temperature**: Represented by \( T \), with the unit of Kelvin (K). For example, in the contrastive dataset, the sampling range for color temperature can be represented as: \[ T \in [2000K, 10000K] \] The normalization process for shutter speed can be represented as: \[ t'=\frac{t - t_{\min}}{t_{\max}-t_{\min}} \] where \( t_{\min} \) and \( t_{\max} \) are the minimum and maximum values of shutter speed respectively. Through these formulas and methods, the paper achieves precise control of camera settings and ensures that the generated images maintain scene consistency under different settings.

Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

3D-Aware Image Synthesis Via Learning Structural and Textural Representations

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Chasing Consistency in Text-to-3D Generation from a Single Image.

SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks.

All-day Thin-Lens Computational Imaging with Scene-Specific Learning Recovery

Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models

From Rule-Based to Learning-Based Image-Conditional Image Generation

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Learning Generative Models of Scene Features

3DFaceShop: Explicitly Controllable 3D-Aware Portrait Generation

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion

Generative Powers of Ten

Text-Guided Scene Sketch-to-Photo Synthesis

Sketch-Guided Scene Image Generation

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis