DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Minghao Chen,Iro Laina,Andrea Vedaldi
2024-07-22
Abstract:We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficiency and consistency issues encountered when editing 3D objects and scenes based on open - language instructions. Specifically, existing methods usually rely on 2D image generators or editors to guide the 3D editing process. Although this method avoids the need for 3D data, it is inefficient and prone to information inconsistency between views due to the need to iteratively update costly 3D representations (such as Neural Radiance Fields). To solve these problems, the authors introduce a new method named Direct Gaussian Editor (DGE). This method improves the accuracy and efficiency of editing through two - stage work: 1. **Multi - view Consistent Editing**: First, they improve a high - quality 2D image editor (such as InstructPix2Pix) to enable it to produce multi - view consistent editing results. The key to this step is to incorporate 3D geometric information into the editing process to ensure that the edits from different perspectives remain consistent. 2. **Direct Optimization of 3D Representations**: After obtaining a series of multi - view consistent edited images, DGE directly and efficiently optimizes the 3D representation instead of incrementally iterating for updates. The 3D representation used here is based on 3D Gaussian Splatting, which is not only faster than the traditional NeRF model but also supports local editing, thereby improving the selectivity and flexibility of editing. Through the above methods, DGE can significantly reduce the processing time while ensuring the editing quality and can achieve selective editing of specific parts in the scene. In addition, the experimental results show that, compared with existing methods, DGE has obvious advantages in both editing speed and the quality of the final results. ### Formula Summary - **Gaussian Function**: \[ g_i(x)=\exp\left(-\frac{1}{2}(x - \mu_i)^T\Sigma_i^{-1}(x - \mu_i)\right) \] - **Color and Opacity Functions**: \[ \sigma(x)=\sum_{i = 1}^G\sigma_i g_i(x),\quad c(x,\nu)=\frac{\sum_{i = 1}^G c_i(\nu)\sigma_i g_i(x)}{\sum_{i = 1}^G\sigma_i g_i(x)} \] - **Feature Injection in Multi - view Consistent Editing**: \[ M_{t'}[u]=\arg\min_{v, v^T F u = 0}D(\Psi_{t'}[u],\Psi_{k^*}[v]),\quad\forall t'\in T\setminus K \] where \(D\) is the cosine distance, \(F\) is the fundamental matrix between two views, \(u\) and \(v\) are the spatial indices of the feature maps respectively, and \(k^*\) is the index of the key view closest to view \(t'\). These formulas and technical details together form the core mechanism of DGE, enabling it to achieve better results in multi - view consistent editing.