MaterialPicker: Multi-Modal Material Generation with Diffusion Transformers

Xiaohe Ma,Valentin Deschaintre,Miloš Hašan,Fujun Luan,Kun Zhou,Hongzhi Wu,Yiwei Hu
2024-12-04
Abstract:High-quality material generation is key for virtual environment authoring and inverse rendering. We propose MaterialPicker, a multi-modal material generator leveraging a Diffusion Transformer (DiT) architecture, improving and simplifying the creation of high-quality materials from text prompts and/or photographs. Our method can generate a material based on an image crop of a material sample, even if the captured surface is distorted, viewed at an angle or partially occluded, as is often the case in photographs of natural scenes. We further allow the user to specify a text prompt to provide additional guidance for the generation. We finetune a pre-trained DiT-based video generator into a material generator, where each material map is treated as a frame in a video sequence. We evaluate our approach both quantitatively and qualitatively and show that it enables more diverse material generation and better distortion correction than previous work.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in high - quality material generation, especially in virtual environment creation and inverse rendering. Specifically, the authors propose **MaterialPicker**, a multi - modal material generation model based on the Diffusion Transformer (DiT) architecture. The following are the main problems that this paper attempts to solve: 1. **Generate high - quality materials from photos or text prompts**: - Traditional material acquisition methods usually require taking dozens or even hundreds of photos under known lighting conditions and have strict requirements for camera positions. This makes it difficult to create materials using field photos. - Existing material generation methods either rely on specific capture conditions or are limited to processing synthetic datasets, limiting the diversity and generalization ability of generation. 2. **Handle distortion, occlusion and perspective changes in images**: - Photos in natural scenes often contain problems such as distortion, partial occlusion or non - orthogonal perspectives. Existing methods are difficult to handle these complex situations. - MaterialPicker can extract material properties from photos taken from any angle, even if the surface is deformed or partially occluded. 3. **Support for multi - modal input**: - Users can generate materials by providing image cropping areas or text descriptions. Text prompts can provide additional guidance for the generation process, helping the model capture material characteristics more accurately. 4. **Improve texture correction and material parameter generation**: - Existing methods have limitations in handling textures in natural images, especially for common occlusion and deformation problems. MaterialPicker can not only correct these distortions, but also generate multiple material parameters (such as albedo, normal, height, roughness and metallicity) simultaneously. 5. **Improve generation speed and quality**: - Compared with existing material generation methods, MaterialPicker has a faster generation speed (12 seconds vs 3 minutes) and generates higher - quality materials, especially in appearance matching and structural texture reconstruction. ### Summary By introducing MaterialPicker, the authors have solved multiple challenges in high - quality material generation, including extracting materials from field photos, handling complex image distortions and occlusions, supporting multi - modal input, and improving generation speed and quality. The model utilizes a pre - trained Diffusion Transformer architecture and is fine - tuned by treating material maps as video frames, thus achieving these improvements.