Style3D: Attention-guided Multi-view Style Transfer for 3D Object Generation

Bingjie Song,Xin Huang,Ruting Xie,Xue Wang,Qing Wang
2024-12-05
Abstract:We present Style3D, a novel approach for generating stylized 3D objects from a content image and a style image. Unlike most previous methods that require case- or style-specific training, Style3D supports instant 3D object stylization. Our key insight is that 3D object stylization can be decomposed into two interconnected processes: multi-view dual-feature alignment and sparse-view spatial reconstruction. We introduce MultiFusion Attention, an attention-guided technique to achieve multi-view stylization from the content-style pair. Specifically, the query features from the content image preserve geometric consistency across multiple views, while the key and value features from the style image are used to guide the stylistic transfer. This dual-feature alignment ensures that spatial coherence and stylistic fidelity are maintained across multi-view images. Finally, a large 3D reconstruction model is introduced to generate coherent stylized 3D objects. By establishing an interplay between structural and stylistic features across multiple views, our approach enables a holistic 3D stylization process. Extensive experiments demonstrate that Style3D offers a more flexible and scalable solution for generating style-consistent 3D assets, surpassing existing methods in both computational efficiency and visual quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the following problems: 1. **Challenges in 3D object stylized generation**: - Most existing methods rely on training for specific styles or cases, which makes them inefficient and lack of flexibility when dealing with new styles and new objects. - Multi - view inconsistency: By independently applying 2D style conversion to each view, existing methods often introduce inconsistencies between multi - view images, resulting in poor 3D reconstruction quality. - High time consumption: These methods usually need to be optimized for specific styles, leading to a long computing time. 2. **The need for instant generation of high - quality, style - consistent 3D objects**: - Users hope to be able to quickly generate 3D objects with any user - defined style to meet the diverse needs in fields such as video games, digital art, and virtual reality. - Existing methods have difficulty applying diverse styles to 3D objects, especially achieving style consistency while maintaining geometric consistency. ### Solutions To solve the above problems, the paper proposes **Style3D**, a new method based on the diffusion model, which can instantly generate high - quality, style - consistent 3D objects from content - style image pairs. Specifically, the main contributions of Style3D include: 1. **Multi - View Dual - Feature Alignment**: - By decomposing the 3D object stylization task into two interrelated processes: multi - view dual - feature alignment and sparse - view spatial reconstruction. - Introducing the MultiFusion Attention mechanism, which is used to align content features and style features under multiple views, ensuring geometric consistency and style fidelity. 2. **Sparse - View Spatial Reconstruction**: - Using the triplane representation for high - fidelity 3D reconstruction, ensuring that the generated 3D objects maintain geometric accuracy and style consistency under different views. 3. **Efficient 3D Stylized Generation**: - Style3D can generate high - quality 3D objects from a single content - style image pair within 30 seconds without an additional training process. - Compared with existing methods, Style3D performs well in both computational efficiency and visual quality, providing a more flexible and scalable solution. Through these innovations, Style3D significantly improves the efficiency and scalability of generating a large number of high - quality, style - consistent 3D objects, and solves the limitations of existing methods in multi - view consistency and computational efficiency.