LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong,Kai Zhang,Jiuxiang Gu,Sai Bi,Yang Zhou,Difan Liu,Feng Liu,Kalyan Sunkavalli,Trung Bui,Hao Tan
2024-03-09
Abstract:We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs, including real-world in-the-wild captures and images created by generative models. Video demos and interactable 3D meshes can be found on our LRM project webpage: <a class="link-external link-https" href="https://yiconghong.me/LRM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper proposes a new method called the Large Reconstruction Model (LRM), aimed at quickly reconstructing high-quality 3D models from a single input image. Specifically, LRM has the following features: 1. **Efficiency**: LRM can complete 3D reconstruction within 5 seconds. 2. **Large-scale Model Architecture**: It adopts a Transformer-based encoder-decoder architecture, containing 500 million learnable parameters. 3. **Large-scale Training Dataset**: It is trained end-to-end on a large-scale multi-view dataset containing approximately 1 million objects, including both synthetic renderings and real captured data. 4. **High Generality**: It can handle various test inputs, including real-world in-the-wild captured images and images created by generative models. Through these designs, LRM addresses several key issues present in traditional methods: - Early learning methods typically performed well only on specific categories because they leveraged category-specific data priors to infer overall shapes. - Recent methods rely on complex parameter tuning and regularization, and are limited by pre-trained 2D generative models. - Some methods require optimizing 3D geometry one by one, which is often slow and impractical. By combining a large-scale model architecture with large-scale training data, LRM achieves efficient and highly general 3D reconstruction.