Abstract:Real-world design tasks - such as picture book creation, film storyboard development using character sets, photo retouching, visual effects, and font transfer - are highly diverse and complex, requiring deep interpretation and extraction of various elements from instructions, descriptions, and reference images. The resulting images often implicitly capture key features from references or user inputs, making it challenging to develop models that can effectively address such varied tasks. While existing visual generative models can produce high-quality images based on prompts, they face significant limitations in professional design scenarios that involve varied forms and multiple inputs and outputs, even when enhanced with adapters like ControlNets and LoRAs. To address this, we introduce IDEA-Bench, a comprehensive benchmark encompassing 100 real-world design tasks, including rendering, visual effects, storyboarding, picture books, fonts, style-based, and identity-preserving generation, with 275 test cases to thoroughly evaluate a model's general-purpose generation capabilities. Notably, even the best-performing model only achieves 22.48 on IDEA-Bench, while the best general-purpose model only achieves 6.81. We provide a detailed analysis of these results, highlighting the inherent challenges and providing actionable directions for improvement. Additionally, we provide a subset of 18 representative tasks equipped with multimodal large language model (MLLM)-based auto-evaluation techniques to facilitate rapid model development and comparison. We releases the benchmark data, evaluation toolkits, and an online leaderboard at <a class="link-external link-https" href="https://github.com/ali-vilab/IDEA-Bench" rel="external noopener nofollow">this https URL</a>, aiming to drive the advancement of generative models toward more versatile and applicable intelligent design systems.

Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

IDEA-Bench: How Far are Generative Models from Professional Designing?

Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Challenges and Opportunities in 3D Content Generation

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

GraphicsDreamer: Image to 3D Generation with Physical Consistency

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation

Instant3D: Instant Text-to-3D Generation

AI-Generated Content (AIGC) for Various Data Modalities: A Survey

Interactive3D: Create What You Want by Interactive 3D Generation

Any-to-3D Generation via Hybrid Diffusion Supervision

LDM3D: Latent Diffusion Model for 3D

3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing