Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object

Junhao Chen,Peng Rong,Jingbo Sun,Chao Li,Xiang Li,Hongwu Lv
2023-11-29
Abstract:Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We introduce a large language model to parse the text and identify stylization goals and specific styles. Combined with a CLIP-based semantic visual embedding encoder, the model understands and matches text and image content. We also introduce a novel localized text-image block matching loss that ensures that style transfer is performed only on specified target objects, while non-target regions remain in their original style. Experimental results demonstrate that our model is able to accurately perform style transfer on target objects according to textual descriptions without affecting the style of background regions. Our code will be available at <a class="link-external link-https" href="https://github.com/yisuanwang/Soulstyler" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper proposes a new framework called Soulstyler, aimed at addressing specific issues in the field of image style transfer. Specifically, most existing image style transfer methods require a reference stylized image and find it difficult to independently stylize specific objects within an image. To overcome this limitation, the researchers developed the Soulstyler framework, which allows users to guide the stylization process of specific objects in an image through simple text descriptions. The main features of Soulstyler include: 1. **Combining large language models with visual encoders**: Soulstyler utilizes large language models (such as GPT-4 and LLAMA-2) to parse text input, identify stylization targets, and specific styles. At the same time, it employs a CLIP-based semantic visual embedding encoder to understand and match the content of text and images. 2. **Localized text-image patch matching loss function**: This innovation ensures that style transfer is executed only on the specified target objects, while non-target areas retain their original style. 3. **Experimental results**: The paper demonstrates that Soulstyler can accurately perform style transfer on specific objects based on text descriptions without affecting the style of the background areas. This proves the model's effectiveness and flexibility in practical applications. In summary, Soulstyler brings significant advancements to the field of image style transfer, particularly in the precise control of stylized objects, providing new tools and technical support for digital art creation, personalized content generation, and other fields.