Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder

Antoine Schnepf,Karim Kassab,Jean-Yves Franceschi,Laurent Caraffa,Flavian Vasile,Jeremie Mary,Andrew Comport,Valerie Gouet-Brunet
2024-10-30
Abstract:While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space. Our project page can be found at <a class="link-external link-https" href="https://ig-ae.github.io" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges of applying inverse graphics in 2D latent spaces. Specifically, the authors point out that although pre - trained image auto - encoders are increasingly widely used in computer vision, relatively few studies have applied inverse graphics in 2D latent spaces. The main reason is that these latent spaces lack the underlying 3D geometric structure, which makes it difficult to directly apply inverse graphics. #### Main problems: 1. **Incompatibility between latent space and 3D geometric structure**: Traditional image latent spaces lack 3D geometric consistency, leading to problems when performing scene learning (such as NeRF model training) in these spaces. 2. **Poor training quality of NeRF models in latent spaces**: Due to the 3D inconsistency of the latent space, the latent NeRF models trained with standard auto - encoders (AE) will have artifacts when decoding and rendering, affecting the final image quality. 3. **Training and rendering efficiency**: Existing NeRF models have high training and rendering complexity in the image space, and more efficient training methods need to be explored. #### Solutions: To solve the above problems, the authors propose a new model - **Inverse Graphics Autoencoder (IG - AE)**. IG - AE introduces 3D geometric consistency by aligning the latent space with jointly - trained 3D scenes, thus solving the incompatibility between the latent space and 3D tasks. Specific contributions include: - **Introducing 3D - aware latent space**: By introducing 3D geometric structure, the latent space has 3D consistency and is suitable for 3D tasks. - **Proposing Inverse Graphics Autoencoder (IG - AE)**: This model can map images to 3D - aware latent spaces while maintaining the performance of auto - encoders. - **Standardized latent NeRF training method**: A general latent NeRF training method is proposed, which includes two stages: Latent Supervision and RGB Alignment. - **Open - source extension of Nerfstudio framework**: An open - source extension has been developed to support training various NeRF models supported by Nerfstudio in the latent space, simplifying future research work. Through these improvements, the authors show that IG - AE can significantly improve the quality of latent NeRF and reduce training time while maintaining rendering quality comparable to that of traditional NeRF. ### Formula summary The main formulas involved in the paper are as follows: 1. **Latent NeRF rendering**: \[ \tilde{z}_p = F_\theta(p), \quad \tilde{x}_p = D_\psi(\tilde{z}_p) \] where \(\tilde{z}_p\) is the rendered latent image with shape \((h, w, c)\), \(\tilde{x}_p\) is the decoded RGB image with shape \((H, W, 3)\), and \(l>1\) is the resolution scaling factor from RGB space to latent space. 2. **Latent supervision loss**: \[ L_{LS}(\theta)=\sum_{p\in P}L_{F_\theta}(\theta; z_p, \tilde{z}_p) \] where \(z_p\) and \(\tilde{z}_p\) are the encoded latent representations of the RGB real image and the rendered latent image respectively, and \(P\) is the set of training camera poses. 3. **RGB alignment loss**: \[ L_{align}(\theta, \psi)=\sum_{p\in P}\|x_p - \tilde{x}_p\|^2_2 \] where \(x_p\) is the RGB real image and \(\tilde{x}_p = D_\psi(\tilde{z}_p)\) is the decoded latent NeRF rendering. 4. **3D regularization loss**