A review of multimodal learning for text to images

Wei Chen,Yuqing Yang,Zijian Tian,Qiteng Chen,Jueting Liu
DOI: https://doi.org/10.1007/s11042-024-19117-8
IF: 2.577
2024-01-01
Multimedia Tools and Applications
Abstract:Information exists in various forms in the real world, and the effective interaction and fusion of multimodal information plays a key role in the research of computer vision and deep learning. Generating an image that matches a given text description is one of the multimodal tasks that requires a strong generative model and cross-modal understanding. This paper provided a comprehensive analysis of recent advances in text-generated images and a taxonomy based on model architecture and characteristics. We introduced the classification of text generated image based on different frames, including text generated image method based on generation adversarial network, transformer and diffusion model. This paper introduced the network structure, advantages and disadvantages of each method, the benchmark data set and corresponding evaluation index, and summarized the application progress and experimental results according to different classification methods. Finally, we provided insights into current research challenges and possible future research directions and applications.
What problem does this paper attempt to address?