Abstract:Large Foundation Models like Dust3r can produce high quality outputs such as pointmaps, camera intrinsics, and depth estimation, given stereo-image pairs as input. However, the application of these outputs on tasks like Visual Localization requires a large amount of inference time and compute resources. To address these limitations, in this paper, we propose the use of a knowledge distillation pipeline, where we aim to build a student-teacher model with Dust3r as the teacher and explore multiple architectures of student models that are trained using the 3D reconstructed points output by Dust3r. Our goal is to build student models that can learn scene-specific representations and output 3D points with replicable performance such as Dust3r. The data set we used to train our models is 12Scenes. We test two main architectures of models: a CNN-based architecture and a Vision Transformer based architecture. For each architecture, we also compare the use of pre-trained models against models built from scratch. We qualitatively compare the reconstructed 3D points output by the student model against Dust3r's and discuss the various features learned by the student model. We also perform ablation studies on the models through hyperparameter tuning. Overall, we observe that the Vision Transformer presents the best performance visually and quantitatively.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to construct a lightweight student model through the Knowledge Distillation technique in the multi - view 3D reconstruction task to replace large - scale base models (such as DUSt3R), thereby reducing the consumption of inference time and computational resources and improving the accuracy and consistency of 3D point cloud output. Specifically, the paper focuses on the following aspects: 1. **Computational Resources and Inference Time**: - Although large - scale base models such as DUSt3R can generate high - quality 3D point clouds, camera intrinsics, and depth estimates, in practical applications (such as visual localization), they require a large amount of inference time and computational resources. To reduce these requirements, the author proposes using knowledge distillation to train smaller and more efficient models. 2. **3D Point Output in the World Coordinate System**: - The 3D point cloud output by DUSt3R is not relative to a fixed world coordinate system, which limits its application in certain tasks. Therefore, the author hopes that the student model can learn scene - specific information so that the output 3D point cloud can be represented in a fixed world coordinate system. 3. **Learning of Scene - Specific Representations**: - In order to improve the performance of the student model in specific scenes, the author hopes that the student model can learn scene - specific representations rather than just generalized features. This helps to achieve better performance in tasks such as visual localization. 4. **Selection and Optimization of Model Architectures**: - The author explores two main student model architectures: based on Convolutional Neural Networks (CNN) and based on Vision Transformer. By comparing different architectures and adjusting hyper - parameters, the most suitable model for task requirements is found. ### Core Objectives of the Paper - **Construct a Lightweight Student Model**: Learn from DUSt3R (the teacher model) through knowledge distillation to construct a student model that can significantly reduce the consumption of computational resources while maintaining high precision. - **Achieve 3D Point Cloud Output in the World Coordinate System**: Ensure that the 3D point cloud output by the student model can be consistently expressed in a fixed world coordinate system. - **Improve the Performance of Scene - Specific Tasks**: By learning scene - specific representations, improve the performance of the student model in downstream tasks such as visual localization. ### Formula Representation The formulas involved in the paper mainly include the loss function and key parts in the model structure. For example, the Mean - Square Error (MSE) loss function used when training the student model can be represented as: \[ L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N}\|\mathbf{p}_i^{stu}-\mathbf{p}_i^{tea}\|^2 \] where \(\mathbf{p}_i^{stu}\) and \(\mathbf{p}_i^{tea}\) respectively represent the \(i\)-th 3D point predicted by the student model and the teacher model, and \(N\) is the total number of points. Through these methods, the paper aims to solve the limitations of existing large - scale base models in the multi - view 3D reconstruction task and provide a more efficient and practical solution.

Mutli-View 3D Reconstruction using Knowledge Distillation

Structured Knowledge Distillation Towards Efficient and Compact Multi-View 3D Detection

X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Enhanced 3D Shape Reconstruction With Knowledge Graph of Category Concept

Multi-view 3D Reconstruction with Transformer

DUSt3R: Geometric 3D Vision Made Easy

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

Single-view 3D reconstruction via dual attention

Data-Driven 3D Reconstruction of Dressed Humans From Sparse Views

Empowering Knowledge Distillation via Open Set Recognition for Robust 3D Point Cloud Classification

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Distilling Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection

Multi-Level Feature Distillation of Joint Teachers Trained on Distinct Image Datasets

Indoor Scene Reconstruction From Monocular Video Combining Contextual and Geometric Priors

Towards Unified Representation of Multi-Modal Pre-training for 3D Understanding via Differentiable Rendering

VolumeFusion: Deep Depth Fusion for 3D Scene Reconstruction

Multi-view 3D Reconstruction from Video with Transformer.