Product2IMG: Prompt-Free E-commerce Product Background Generation with Diffusion Model and Self-Improved LMM

Tingfeng Cao,Junsheng Kong,Xue Zhao,Wenqing Yao,Junwei Ding,Jinhui Zhu,Jiandong Zhang
DOI: https://doi.org/10.1145/3664647.3680753
2024-01-01
Abstract:In e-commerce platforms, visual content plays a pivotal role in capturing and retaining audience attention. A high-quality and aesthetically designed product background image can quickly grab consumers' attention, and increase their confidence in taking actions, such as making a purchase. Recently, diffusion models have achieved profound advancements, rendering product background generation a promising avenue for exploration. However, text-guided diffusion models require meticulously crafted prompts. The diverse range of products makes it challenging to compose prompts that result in visually appealing and semantically appropriate background scenes. Current work has made great efforts on creating prompts through expert-crafted rules or specialized fine-tuning of large language models, but it still relies on detailed human inputs and often falls short in generating desirable results by e-commerce standards. In this paper, we propose Product2Img, a novel prompt-free diffusion model with automatic training data refinement strategy for product background generation. Product2Img employs Contrastive Background Alignment (CBA) for the text encoder to enhance the relevant background perception ability in the diffusion generation process, without the need for specific background prompts. Meanwhile, we develope the Iterative Data Refinement with Self-improved Large Multimodal Model (IDR-LMM), a framework that iteratively enhances the data selection capability of LMM for diffusion model training, thereby yielding continuous performance improvements. Furthermore, we establish an E-commerce Product Background Dataset (EPBD) for the research in this paper and future work. Experimental results indicate that our approach significantly outperforms current prevalent methods in terms of automatic metrics and human evaluation, yielding improved background aesthetics and relevance.
What problem does this paper attempt to address?