Abstract:High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in high - quality material generation, especially in virtual environment creation and inverse rendering. Specifically, the authors propose **MaterialPicker**, a multi - modal material generation model based on the Diffusion Transformer (DiT) architecture. The following are the main problems that this paper attempts to solve: 1. **Generate high - quality materials from photos or text prompts**: - Traditional material acquisition methods usually require taking dozens or even hundreds of photos under known lighting conditions and have strict requirements for camera positions. This makes it difficult to create materials using field photos. - Existing material generation methods either rely on specific capture conditions or are limited to processing synthetic datasets, limiting the diversity and generalization ability of generation. 2. **Handle distortion, occlusion and perspective changes in images**: - Photos in natural scenes often contain problems such as distortion, partial occlusion or non - orthogonal perspectives. Existing methods are difficult to handle these complex situations. - MaterialPicker can extract material properties from photos taken from any angle, even if the surface is deformed or partially occluded. 3. **Support for multi - modal input**: - Users can generate materials by providing image cropping areas or text descriptions. Text prompts can provide additional guidance for the generation process, helping the model capture material characteristics more accurately. 4. **Improve texture correction and material parameter generation**: - Existing methods have limitations in handling textures in natural images, especially for common occlusion and deformation problems. MaterialPicker can not only correct these distortions, but also generate multiple material parameters (such as albedo, normal, height, roughness and metallicity) simultaneously. 5. **Improve generation speed and quality**: - Compared with existing material generation methods, MaterialPicker has a faster generation speed (12 seconds vs 3 minutes) and generates higher - quality materials, especially in appearance matching and structural texture reconstruction. ### Summary By introducing MaterialPicker, the authors have solved multiple challenges in high - quality material generation, including extracting materials from field photos, handling complex image distortions and occlusions, supporting multi - modal input, and improving generation speed and quality. The model utilizes a pre - trained Diffusion Transformer architecture and is fine - tuned by treating material maps as video frames, thus achieving these improvements.

MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Material Anything: Generating Materials for Any 3D Object via Diffusion

DiffMat: Latent diffusion models for image-guided material generation

Dynamic Diffusion Transformer

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On

PhotoMat: A Material Generator Learned from Single Flash Photos

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

NeuMaDiff: Neural Material Synthesis via Hyperdiffusion

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

MatFuse: Controllable Material Generation with Diffusion Models

ControlMat: A Controlled Generative Approach to Material Capture

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

DiT4Edit: Diffusion Transformer for Image Editing

DiCTI: Diffusion-based Clothing Designer via Text-guided Input

DiffiT: Diffusion Vision Transformers for Image Generation

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Collaborative Diffusion for Multi-Modal Face Generation and Editing

DreamPBR: Text-driven Generation of High-resolution SVBRDF with Multi-modal Guidance