Abstract:We present Text-driven Object-Centric Style Transfer (TEXTOC), a novel method that guides style transfer at an object-centric level using textual inputs. The core of TEXTOC is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric transformations that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style transfer across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style transfers.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of text - driven object - centric style transfer. Specifically, the author proposes a new method named **Text - driven Object - Centric Style Transfer (TEXTOC)** to achieve the following goals: 1. **Object - level style transfer**: Traditional style transfer methods usually rely on reference images to guide the style conversion process. However, these methods have limitations when dealing with complex scenes, especially when it is necessary to stylize specific objects. TEXTOC solves this problem by using text descriptions to directly guide style transfer. 2. **Maintaining the structural integrity of objects and backgrounds**: When applying style transfer, many existing methods may change the original content of objects or the structural information of backgrounds, resulting in inconsistent visual effects. TEXTOC introduces multiple loss functions and technical modules to ensure that while performing style transfer, the structural integrity of objects and the original style of backgrounds are preserved. 3. **No need for segmentation masks or reference images**: Existing object - centric style transfer methods usually rely on segmentation masks or reference images to identify and process specific objects. TEXTOC eliminates the need for these additional inputs and can achieve accurate object localization and style transfer only through text descriptions. ### Main contributions To achieve the above goals, the paper proposes the following key technologies and components: - **Patch - wise Co - Directional (PCD) Loss**: It is used to ensure that the direction of style transfer is consistent with the text description and to maintain the consistency of feature distribution within the object area. \[ L_{\text{pcd}}=\lambda_{\text{dir}}L_{\text{dir}}+\lambda_{\text{con}}L_{\text{con}} \] where, - \(L_{\text{dir}}\) is the patch - wise direction loss, which is defined as: \[ L_{\text{dir}}=\frac{1}{N}\sum_{i = 1}^{N}\left(1-\frac{\Delta P_i\cdot\Delta T}{|\Delta P_i||\Delta T|}\right) \] where, \[ \Delta P_i = E_I(\text{aug}(P_i^{\text{out}}))-E_I(\text{aug}(P_i^{\text{src}})),\quad\Delta T = E_T(T_{\text{tgt}})-E_T(T_{\text{src}}) \] - \(L_{\text{con}}\) is the patch distribution consistency loss, which is defined as: \[ L_{\text{con}}=\text{JSD}(D_{\text{src}}, D_{\text{out}}) \] where JSD represents Jensen - Shannon divergence. - **Adaptive Background Preservation (ABP) Loss**: It is used to ensure that the style and structure of the background area remain unchanged and prevent the background from being mis - transferred. \[ L_{\text{abp}}=L_{\text{MSSSIM}}(I_{\text{out}}\odot M_{\text{bg}}^*, I_{\text{src}}\odot M_{\text{bg}}^*)+L_{L1}(I_{\text{out}}\odot M_{\text{bg}}^*, I_{\text{src}}

TEXTOC: Text-driven Object-Centric Style Transfer

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

TeSTNeRF: Text-Driven 3D Style Transfer Via Cross-Modal Learning.

Diversified Patch-based Style Transfer with Shifted Style Normalization

ITstyler: Image-optimized Text-based Style Transfer

CLIPstyler: Image Style Transfer with a Single Text Condition

TextStyler: A CLIP-based approach to text-guided style transfer

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Name Your Style: An Arbitrary Artist-aware Image Style Transfer

SC2: Towards Enhancing Content Preservation and Style Consistency in Long Text Style Transfer

CPST: Comprehension-Preserving Style Transfer for Multi-Modal Narratives

Sem-CS: Semantic CLIPStyler for Text-Based Image Style Transfer

Contextual Text Style Transfer

MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP

Any-to-Any Style Transfer: Making Picasso and Da Vinci Collaborate

DeepObjStyle: Deep Object-based Photo Style Transfer

Foreground and background separated image style transfer with a single text condition

Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object

So Different Yet So Alike! Constrained Unsupervised Text Style Transfer

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer