Abstract:Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention

Two-Stage Multi-Camera Constrain Mapping Pipeline for Large-Scale 3D Reconstruction

LRM: Large Reconstruction Model for Single Image to 3D

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Real3D: Scaling Up Large Reconstruction Models with Real-World Images

DP-MVS: Detail Preserving Multi-View Surface Reconstruction of Large-Scale Scenes

GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation

MeshLRM: Large Reconstruction Model for High-Quality Mesh

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

LAM3D: Large Image-Point-Cloud Alignment Model for 3D Reconstruction from Single Image

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

Multi-View Stereo Representation Revist: Region-Aware MVSNet

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion

ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

Large Spatial Model: End-to-end Unposed Images to Semantic 3D