Abstract:The multifaceted nature of human perception and comprehension indicates that, when we think, our body can naturally take any combination of senses, a.k.a., modalities and form a beautiful picture in our brain. For example, when we see a cattery and simultaneously perceive the cat's purring sound, our brain can construct a picture of a cat in the cattery. Intuitively, generative AI models should hold the versatility of humans and be capable of generating images from any combination of modalities efficiently and collaboratively. This paper presents ImgAny, a novel end-to-end multi-modal generative model that can mimic human reasoning and generate high-quality images. Our method serves as the first attempt in its capacity of efficiently and flexibly taking any combination of seven modalities, ranging from language, audio to vision modalities, including image, point cloud, thermal, depth, and event data. Our key idea is inspired by human-level cognitive processes and involves the integration and harmonization of multiple input modalities at both the entity and attribute levels without specific tuning across modalities. Accordingly, our method brings two novel training-free technical branches: 1) Entity Fusion Branch ensures the coherence between inputs and outputs. It extracts entity features from the multi-modal representations powered by our specially constructed entity knowledge graph; 2) Attribute Fusion Branch adeptly preserves and processes the attributes. It efficiently amalgamates distinct attributes from diverse input modalities via our proposed attribute knowledge graph. Lastly, the entity and attribute features are adaptively fused as the conditional inputs to the pre-trained Stable Diffusion model for image generation. Extensive experiments under diverse modality combinations demonstrate its exceptional capability for visual content creation.

Multi3D: 3D-Aware Multimodal Image Synthesis

Fine-grained Semantic Constraint in Image Synthesis

3D-Aware Image Synthesis Via Learning Structural and Textural Representations

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Multi-View Consistent Generative Adversarial Networks for 3D-Aware Image Synthesis

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Semantically Multi-Modal Image Synthesis

Multi-view Consistent Generative Adversarial Networks for Compositional 3D-Aware Image Synthesis

Multimodal Image Synthesis and Editing: The Generative AI Era

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

Unified Multi-Modal Image Synthesis for Missing Modality Imputation

Multi-Constraint Transferable Generative Adversarial Networks for Cross-Modal Brain Image Synthesis

Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

Semantic RGB-D Image Synthesis

3D-SSGAN: Lifting 2D Semantics for 3D-Aware Compositional Portrait Synthesis

CMOS-GAN: Semi-Supervised Generative Adversarial Model for Cross-Modality Face Image Synthesis

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion

A survey on multimodal-guided visual content synthesis