Abstract:This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and de-noising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis. Our code, data, and model can be found in \url{<a class="link-external link-https" href="https://github.com/kkaiwwana/MVPbev" rel="external noopener nofollow">this https URL</a>}.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate multi - view perspective RGB images from Bird - Eye - View (BEV) semantics, and have controllability and generalization ability at test time. Specifically, there are three main problems in existing methods:
1. **Insufficient controllability at test time**: Existing frameworks rely highly on training samples, resulting in an inability to well control camera pose or provide additional object instance control at test time.
2. **Poor cross - view consistency**: Existing methods fail to well enforce cross - view consistency, resulting in inconsistent visual effects in the overlapping Field of View (FOV).
3. **Lack of detailed human analysis**: There is no thorough human analysis on the image generation task, resulting in difficult - to - interpret comparison results.
To overcome these problems, the paper proposes a two - stage method MVPbev, aiming to generate controllable multi - view perspective RGB images from given BEV semantics and text prompts, and improve controllability and generalization ability at test time by explicitly enforcing cross - view consistency.
### Specific methods
#### 1. Semantically consistent view projection
The first stage projects BEV semantics onto multiple perspective views to generate a series of perspective semantic maps $\{S_m\}_{m = 1}^M$. This stage utilizes geometric transformation to ensure semantic consistency between BEV and perspective views and reduce cumulative errors in the generation step.
#### 2. View - consistent image generation
The second stage parses perspective semantic maps and text prompts to generate multi - view RGB images. To ensure cross - view consistency, a Multi - View Attention Module is introduced. It estimates the homography matrix of the overlapping area and implicitly enforces the style consistency of different views. In addition, it also explicitly enforces the visual cue consistency in the overlapping FOV through special initialization and denoising design.
### Model training and inference
#### Training
The model is trained using the multi - view Latent Diffusion Models (LDMs) loss function. Specifically, the input image $I_m$ is mapped to the latent space $l_m=\epsilon(I_m)$, and then processed by the denoising network $\delta_\theta$ and the conditional encoder $\tau_\theta$. The training objective is to make the generated noise as close as possible to the real noise.
#### Inference
In the inference stage, the values of pixels in the overlapping area are re - allocated to generate visually consistent multi - view images. This process is carried out before the decoder $D$ generates the final RGB image.
### Experiment
#### Dataset
The experiment is verified on the NuScenes dataset, which provides full 360 - degree coverage of six cameras, contains 1,000 street - view scenes, each lasting 20 seconds and captured at a frequency of 12Hz. The dataset also includes multi - modal data, such as the global map layer and 3D object bounding boxes on 40,000 keyframes.
#### Evaluation metrics
- **Image quality**: It is evaluated using Fréchet Inception Distance (FID), Inception Score (IS) and CLIP Score (CS).
- **Visual consistency**: It is measured by calculating the Peak Signal - to - Noise Ratio (PSNR) of the overlapping area.
- **Semantic consistency**: The Intersection - over - Union (IoU) score is used to measure the pixel - level semantic consistency between the generated image and the real image.
- **Object - level controllability**: It is measured by the average color difference Delta - E in the CIELAB color space and its standard deviation.
#### Human analysis
In addition to quantitative metrics, human analysis is also carried out, requiring humans to judge whether the images generated by different methods are more visually realistic and consistent.
### Results
The paper shows that MVPbev outperforms existing methods in various evaluation metrics, especially in cross - view consistency and controllability at test time. Human analysis also further validates the images generated by MVPbev.