T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Yuze He,Yushi Bai,Matthieu Lin,Wang Zhao,Yubin Hu,Jenny Sheng,Ran Yi,Juanzi Li,Yong-Jin Liu
2024-04-17
Abstract:Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: <a class="link-external link-https" href="https://t3bench.com" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the lack of systematic evaluation benchmarks in current Text-to-3D generation methods. Specifically, the paper identifies two main issues in the evaluation of Text-to-3D methods: 1. **Lack of diverse and challenging test text inputs**: Most studies use overly simple text prompts that fail to comprehensively reflect the actual performance of Text-to-3D methods. 2. **Lack of automatic and comprehensive evaluation metrics**: Existing evaluation methods mostly rely on subjective user experiments or assess the quality of 3D generation through single-view 2D images, which cannot fully measure the overall quality of 3D scenes. To address these shortcomings, the paper introduces T3Bench, the first comprehensive Text-to-3D benchmark platform. T3Bench includes three sets of text prompts with increasing complexity and proposes two automatic evaluation metrics based on multi-view images, which are used to assess the subjective quality and text alignment of the generated 3D scenes. These metrics can effectively detect quality and view consistency issues in generated 3D scenes, providing a new paradigm for efficiently evaluating Text-to-3D models.