Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis

Yu Yuan,Xijun Wang,Yichen Sheng,Prateek Chennuri,Xingguang Zhang,Stanley Chan
2024-12-03
Abstract:Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main problems that current text - to - image generation models (such as Stable Diffusion 3 and FLUX) face when generating images with specific camera settings: 1. **Inability to accurately interpret camera settings**: Existing generation models are unable to understand and generate images consistent with specific camera settings (such as focal length, shutter speed, aperture, etc.). For example, when a user requests to generate the same scene at different focal lengths, the generated images will not only change the field of view (FoV), but also the objects in the scene (such as rocks and trees), resulting in inconsistent results. 2. **Difficulty in maintaining scene consistency**: When adjusting camera settings, the generation model often causes changes in the scene content, that is, elements such as buildings and people in the generated images will change, rather than simply reflecting the changes in camera settings. This makes the generated images lose their sense of reality and consistency. To solve these problems, the paper introduces the concept of **Generative Photography**, a new paradigm that aims to generate realistic images by controlling the intrinsic settings of the camera while maintaining scene consistency. Specifically, the paper proposes two key techniques: - **Dimensionality Lifting**: Elevate the multi - camera - setting image generation problem to a video generation problem, thereby separating the invariant scene description from the camera settings to ensure scene consistency. - **Contrastive Camera Learning**: By constructing a contrastive dataset and designing a contrastive camera encoder, the model can better understand and generate different camera effects. Through these methods, the paper aims to achieve precise control of various camera settings (such as shutter speed, aperture, focal length, and color temperature), while maintaining the scene consistency of the generated images, thereby generating more realistic images. ### Formula representation When describing camera settings, the paper involves some physical parameters, for example: - **Focal Length**: Represented by \( f \), with the unit of millimeter (mm). - **Shutter Speed**: Represented by \( t \), with the unit of second (s). - **Aperture**: Represented by \( A \), usually given in the form of \( f/\text{number} \). - **Color Temperature**: Represented by \( T \), with the unit of Kelvin (K). For example, in the contrastive dataset, the sampling range for color temperature can be represented as: \[ T \in [2000K, 10000K] \] The normalization process for shutter speed can be represented as: \[ t'=\frac{t - t_{\min}}{t_{\max}-t_{\min}} \] where \( t_{\min} \) and \( t_{\max} \) are the minimum and maximum values of shutter speed respectively. Through these formulas and methods, the paper achieves precise control of camera settings and ensures that the generated images maintain scene consistency under different settings.