Abstract:While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space. Our project page can be found at <a class="link-external link-https" href="https://ig-ae.github.io" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges of applying inverse graphics in 2D latent spaces. Specifically, the authors point out that although pre - trained image auto - encoders are increasingly widely used in computer vision, relatively few studies have applied inverse graphics in 2D latent spaces. The main reason is that these latent spaces lack the underlying 3D geometric structure, which makes it difficult to directly apply inverse graphics. #### Main problems: 1. **Incompatibility between latent space and 3D geometric structure**: Traditional image latent spaces lack 3D geometric consistency, leading to problems when performing scene learning (such as NeRF model training) in these spaces. 2. **Poor training quality of NeRF models in latent spaces**: Due to the 3D inconsistency of the latent space, the latent NeRF models trained with standard auto - encoders (AE) will have artifacts when decoding and rendering, affecting the final image quality. 3. **Training and rendering efficiency**: Existing NeRF models have high training and rendering complexity in the image space, and more efficient training methods need to be explored. #### Solutions: To solve the above problems, the authors propose a new model - **Inverse Graphics Autoencoder (IG - AE)**. IG - AE introduces 3D geometric consistency by aligning the latent space with jointly - trained 3D scenes, thus solving the incompatibility between the latent space and 3D tasks. Specific contributions include: - **Introducing 3D - aware latent space**: By introducing 3D geometric structure, the latent space has 3D consistency and is suitable for 3D tasks. - **Proposing Inverse Graphics Autoencoder (IG - AE)**: This model can map images to 3D - aware latent spaces while maintaining the performance of auto - encoders. - **Standardized latent NeRF training method**: A general latent NeRF training method is proposed, which includes two stages: Latent Supervision and RGB Alignment. - **Open - source extension of Nerfstudio framework**: An open - source extension has been developed to support training various NeRF models supported by Nerfstudio in the latent space, simplifying future research work. Through these improvements, the authors show that IG - AE can significantly improve the quality of latent NeRF and reduce training time while maintaining rendering quality comparable to that of traditional NeRF. ### Formula summary The main formulas involved in the paper are as follows: 1. **Latent NeRF rendering**: \[ \tilde{z}_p = F_\theta(p), \quad \tilde{x}_p = D_\psi(\tilde{z}_p) \] where \(\tilde{z}_p\) is the rendered latent image with shape \((h, w, c)\), \(\tilde{x}_p\) is the decoded RGB image with shape \((H, W, 3)\), and \(l>1\) is the resolution scaling factor from RGB space to latent space. 2. **Latent supervision loss**: \[ L_{LS}(\theta)=\sum_{p\in P}L_{F_\theta}(\theta; z_p, \tilde{z}_p) \] where \(z_p\) and \(\tilde{z}_p\) are the encoded latent representations of the RGB real image and the rendered latent image respectively, and \(P\) is the set of training camera poses. 3. **RGB alignment loss**: \[ L_{align}(\theta, \psi)=\sum_{p\in P}\|x_p - \tilde{x}_p\|^2_2 \] where \(x_p\) is the RGB real image and \(\tilde{x}_p = D_\psi(\tilde{z}_p)\) is the decoded latent NeRF rendering. 4. **3D regularization loss**

Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder

Scaled Inverse Graphics: Efficiently Learning Large Sets of 3D Scenes

Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations

HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields

NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors

AutoNeRF: Training Implicit Scene Representations with Autonomous Agents

GANeRF: Leveraging Discriminators to Optimize Neural Radiance Fields

ED-NeRF: Efficient Text-Guided Editing of 3D Scene with Latent Space NeRF

GeoNeRF: Generalizing NeRF with Geometry Priors

IE-NeRF: Inpainting Enhanced Neural Radiance Fields in the Wild

GenN2N: Generative NeRF2NeRF Translation

TriPlaneNet: An Encoder for EG3D Inversion

HyperNeRFGAN: Hypernetwork approach to 3D NeRF GAN

LatentEditor: Text Driven Local Editing of 3D Scenes

Gaussian Splatting Decoder for 3D-aware Generative Adversarial Networks

NeRF-In: Free-Form Inpainting for Pretrained NeRF With RGB-D Priors

ScatterNeRF: Seeing Through Fog with Physically-Based Inverse Neural Rendering

NeRF-VAE: A Geometry Aware 3D Scene Generative Model

GL-NeRF: Gauss-Laguerre Quadrature Enables Training-Free NeRF Acceleration

FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded Augmentations