Abstract:Abstract The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal. Current approaches are based on a stacked modular framework that suffers from three vital issues: (1) Training separate networks is time-consuming, inefficient and the convergence of the final generative model depends on the previous generators; (2) The quality of precursor images is ignored; (3) Multiple discriminator networks need to be trained. We propose an efficient and effective single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples on the basis of spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module (VSFM), with a pixel-attention module (PAM), a speech-modulation module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding from a speech encoder into the generator while improving the quality of synthesized pictures. The PAM module models the semantic affinities between pixel regions and by assigning larger weights to significant locations. The VSFM module adopts SMM to modulate visual feature maps using fine-grained linguistic cues present in the speech vector. Subsequently, the weighted-fusion model (WFM) captures the semantic importance of the image-attention mask and the speech-modulation module at the level of the channels, in an adaptive manner. Fusion-S2iGan spreads the bimodal information over all layers of the generator network to reinforce the visual feature maps at various hierarchical levels in the architecture. A series of experiments is conducted on four benchmark data sets: CUB birds, Oxford-102, Flickr8k and Places-subset. Results demonstrate the superiority of Fusion-S2iGan compared to the state-of-the-art models with a multi-stage architecture and a performance level that is close to traditional text-to-image approaches.

CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis

Tf-Gan: Text Feature Fusion Gan for Text-to-Image Generation

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Synthesizing Contrast-enhanced Computed Tomography Images with an Improved Conditional Generative Adversarial Network

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

ISF-GAN: Imagine, Select, and Fuse with GPT-Based Text Enrichment for Text-to-Image Synthesis

A Framework For Image Synthesis Using Supervised Contrastive Learning

GACnet-Text-to-Image Synthesis With Generative Models Using Attention Mechanisms With Contrastive Learning

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis

Cross-modal Feature Alignment Based Hybrid Attentional Generative Adversarial Networks for Text-to-image Synthesis

KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis

DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis

Spatial Fusion GAN for Image Synthesis

CT-GAN: A conditional Generative Adversarial Network of transformer architecture for text-to-image

CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis

Adma-GAN: Attribute-Driven Memory Augmented GANs for Text-to-Image Generation.

CBAM-GAN: Generative Adversarial Networks Based on Convolutional Block Attention Module

Fusion-s2igan: an efficient and effective single-stage framework for speech-to-image generation