Abstract:We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape's geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to more precisely control the shape and form of the generated image in text - to - image synthesis. Specifically, the existing methods face the following challenges when combining text descriptions and 3D shape information: 1. **Balancing text and visual conditions**: When the text describes a specific background or context, existing methods find it difficult to maintain fidelity to both the text and the target shape simultaneously. 2. **View - dependence**: Commonly used methods such as edge maps or depth maps are limited to a single view, preventing users from exploring shape changes from different angles and losing valuable 3D shape information. 3. **Lack of flexibility**: Even if the model can accurately reflect the target shape from a specific view, it is difficult for users to flexibly explore shape changes. To solve these problems, the paper proposes the **ShapeWords** method. ShapeWords embeds 3D shapes into text prompts, enabling the generated images to faithfully reflect both the text description and the 3D geometric structure. In addition, ShapeWords allows users to control the degree of shape guidance, so that diverse images with style variations but still conforming to the target shape can be generated. ### Core contributions of ShapeWords - **Introduction of 3D shape tokens**: ShapeWords uses special tokens to embed 3D shape information into text prompts, enabling text - to - image models to generate reasonable images that conform to both 3D geometric structures and text conditions. - **User - controllable shape guidance**: Users can adjust parameters to control the intensity of shape guidance, thereby exploring target shape changes in different postures and appearances. - **Improved experimental results**: Experiments show that ShapeWords significantly outperforms ControlNet - based variants on multiple evaluation metrics, especially in the case of compositional prompts. Through these innovations, ShapeWords overcomes the limitations of existing methods in shape control and text consistency, providing a more powerful and flexible tool for text - to - image generation.

ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Chasing Consistency in Text-to-3D Generation from a Single Image.

ShapeSynth: Parameterizing Model Collections for Coupled Shape Exploration and Synthesis

ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Control3D: Towards Controllable Text-to-3D Generation

ShapeWordle: Tailoring Wordles using Shape-aware Archimedean Spirals

HeadSculpt: Crafting 3D Head Avatars with Text

Text-Free Controllable 3-D Point Cloud Generation

Text Guided Person Image Synthesis

EXIM: A Hybrid Explicit-Implicit Representation for Text-Guided 3D Shape Generation

HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model

Text-to-3D Shape Generation

TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision

Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation

Artistic Text Stylization for Visual-Textual Presentation Synthesis

Text‐to‐3D Shape Generation

SynthText3D: synthesizing scene text images from 3D virtual worlds

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Shape-Guided Diffusion with Inside-Outside Attention