T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Yuze He,Yushi Bai,Matthieu Lin,Wang Zhao,Yubin Hu,Jenny Sheng,Ran Yi,Juanzi Li,Yong-Jin Liu

2024-04-17

Abstract:Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: <a class="link-external link-https" href="https://t3bench.com" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Computation and Language,Machine Learning

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the lack of systematic evaluation benchmarks in current Text-to-3D generation methods. Specifically, the paper identifies two main issues in the evaluation of Text-to-3D methods: 1. **Lack of diverse and challenging test text inputs**: Most studies use overly simple text prompts that fail to comprehensively reflect the actual performance of Text-to-3D methods. 2. **Lack of automatic and comprehensive evaluation metrics**: Existing evaluation methods mostly rely on subjective user experiments or assess the quality of 3D generation through single-view 2D images, which cannot fully measure the overall quality of 3D scenes. To address these shortcomings, the paper introduces T3Bench, the first comprehensive Text-to-3D benchmark platform. T3Bench includes three sets of text prompts with increasing complexity and proposes two automatic evaluation metrics based on multi-view images, which are used to assess the subjective quality and text alignment of the generated 3D scenes. These metrics can effectively detect quality and view consistency issues in generated 3D scenes, providing a new paradigm for efficiently evaluating Text-to-3D models.

T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Chasing Consistency in Text-to-3D Generation from a Single Image.

GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

A Survey On Text-to-3D Contents Generation In The Wild

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

Instant3D: Instant Text-to-3D Generation

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

Control3D: Towards Controllable Text-to-3D Generation

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Text‐to‐3D Shape Generation

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion