Toward Multi-Modal Conditioned Fashion Image Translation.

Xiaoling Gu,Jun Yu,Yongkang Wong,Mohan S. Kankanhalli
DOI: https://doi.org/10.1109/tmm.2020.3009500
IF: 7.3
2020-01-01
IEEE Transactions on Multimedia
Abstract:Having the capability to synthesize photo-realistic fashion product images conditioned on multiple attributes or modalities would bring many new exciting applications. In this work, we propose an end-to-end network architecture that built upon a new generative adversarial network for automatically synthesizing photo-realistic images of fashion products under multiple conditions. Given an input pose image that consists of a 2D skeleton pose and a sentence description of products, our model synthesizes a fashion image preserving the same pose and wearing the fashion products described as the text. Specifically, the generator $G$ tries to generate realistic-looking fashion images based on a $\langle \mathsf {pose}, \mathsf {text} \rangle$ pair condition to fool the discriminator. An attention network is added for enhancing the generator, which predicts a probability map indicating which part of the image needs to be attended for translation. In contrast, the discriminator $D$ distinguishes real images from the translated ones based on the input pose image and text description. The discriminator is divided into two multi-scale sub-discriminators for improving image distinguishing task. Quantitative and qualitative analysis demonstrates that our method is capable of synthesizing realistic images that retain the poses of given images while matching the semantics of provided sentence descriptions.
What problem does this paper attempt to address?