Abstract:We consider the problem of editing 3D objects and scenes based on open-ended language instructions. A common approach to this problem is to use a 2D image generator or editor to guide the 3D editing process, obviating the need for 3D data. However, this process is often inefficient due to the need for iterative updates of costly 3D representations, such as neural radiance fields, either through individual view edits or score distillation sampling. A major disadvantage of this approach is the slow convergence caused by aggregating inconsistent information across views, as the guidance from 2D models is not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two stages. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. To do so, we propose a training-free approach that integrates cues from the 3D geometry of the underlying scene. Second, given a multi-view consistent edited sequence of images, we directly and efficiently optimize the 3D representation, which is based on 3D Gaussian Splatting. Because it avoids incremental and iterative edits, DGE is significantly more accurate and efficient than existing approaches and offers additional benefits, such as enabling selective editing of parts of the scene.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficiency and consistency issues encountered when editing 3D objects and scenes based on open - language instructions. Specifically, existing methods usually rely on 2D image generators or editors to guide the 3D editing process. Although this method avoids the need for 3D data, it is inefficient and prone to information inconsistency between views due to the need to iteratively update costly 3D representations (such as Neural Radiance Fields). To solve these problems, the authors introduce a new method named Direct Gaussian Editor (DGE). This method improves the accuracy and efficiency of editing through two - stage work: 1. **Multi - view Consistent Editing**: First, they improve a high - quality 2D image editor (such as InstructPix2Pix) to enable it to produce multi - view consistent editing results. The key to this step is to incorporate 3D geometric information into the editing process to ensure that the edits from different perspectives remain consistent. 2. **Direct Optimization of 3D Representations**: After obtaining a series of multi - view consistent edited images, DGE directly and efficiently optimizes the 3D representation instead of incrementally iterating for updates. The 3D representation used here is based on 3D Gaussian Splatting, which is not only faster than the traditional NeRF model but also supports local editing, thereby improving the selectivity and flexibility of editing. Through the above methods, DGE can significantly reduce the processing time while ensuring the editing quality and can achieve selective editing of specific parts in the scene. In addition, the experimental results show that, compared with existing methods, DGE has obvious advantages in both editing speed and the quality of the final results. ### Formula Summary - **Gaussian Function**: \[ g_i(x)=\exp\left(-\frac{1}{2}(x - \mu_i)^T\Sigma_i^{-1}(x - \mu_i)\right) \] - **Color and Opacity Functions**: \[ \sigma(x)=\sum_{i = 1}^G\sigma_i g_i(x),\quad c(x,\nu)=\frac{\sum_{i = 1}^G c_i(\nu)\sigma_i g_i(x)}{\sum_{i = 1}^G\sigma_i g_i(x)} \] - **Feature Injection in Multi - view Consistent Editing**: \[ M_{t'}[u]=\arg\min_{v, v^T F u = 0}D(\Psi_{t'}[u],\Psi_{k^*}[v]),\quad\forall t'\in T\setminus K \] where \(D\) is the cosine distance, \(F\) is the fundamental matrix between two views, \(u\) and \(v\) are the spatial indices of the feature maps respectively, and \(k^*\) is the index of the key view closest to view \(t'\). These formulas and technical details together form the core mechanism of DGE, enabling it to achieve better results in multi - view consistent editing.

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Efficient Density Control for 3D Gaussian Splatting

View-Consistent 3D Editing with Gaussian Splatting

3D Gaussian Editing with A Single Image

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

ProGDF: Progressive Gaussian Differential Field for Controllable and Flexible 3D Editing

GSEdit: Efficient Text-Guided Editing of 3D Objects via Gaussian Splatting

GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

GSEditPro: 3D Gaussian Splatting Editing with Attention‐based Progressive Localization

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

Localized Gaussian Splatting Editing with Contextual Awareness

3DitScene: Editing Any Scene via Language-guided Disentangled Gaussian Splatting

ICE-G: Image Conditional Editing of 3D Gaussian Splats

TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing

Learning Naturally Aggregated Appearance for Efficient 3D Editing

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Edit3D: Elevating 3D Scene Editing with Attention-Driven Multi-Turn Interactivity

TrAME: Trajectory-Anchored Multi-View Editing for Text-Guided 3D Gaussian Splatting Manipulation