Bridging Text and Image for Artist Style Transfer via Contrastive Learning

Zhi-Song Liu,Li-Wen Wang,Jun Xiao,Vicky Kalogeiton
2024-10-12
Abstract:Image style transfer has attracted widespread attention in the past few years. Despite its remarkable results, it requires additional style images available as references, making it less flexible and inconvenient. Using text is the most natural way to describe the style. More importantly, text can describe implicit abstract styles, like styles of specific artists or art movements. In this paper, we propose a Contrastive Learning for Artistic Style Transfer (CLAST) that leverages advanced image-text encoders to control arbitrary style transfer. We introduce a supervised contrastive training strategy to effectively extract style descriptions from the image-text model (i.e., CLIP), which aligns stylization with the text description. To this end, we also propose a novel and efficient adaLN based state space models that explore style-content fusion. Finally, we achieve a text-driven image style transfer. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods in artistic style transfer. More importantly, it does not require online fine-tuning and can render a 512x512 image in 0.03s.
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
### Problems the paper attempts to solve The problem that this paper attempts to solve is the lack of flexibility and convenience in existing image style transfer methods. Specifically: 1. **Limitations of traditional style transfer methods**: - Existing style transfer methods usually require a reference style image to provide style information, which makes the methods less flexible and convenient. It is often difficult or infeasible to find a style image that fully meets the requirements. - Using text to describe the style is a more natural way because text can describe implicit abstract styles, such as the styles of specific artists or art schools. 2. **Text - driven style transfer**: - Compared with using style images, using text to describe style preferences is easier to obtain and more adjustable. - Achieving perceptually pleasing artist - style transfer usually requires learning from multiple artworks because a single reference image is not sufficient to represent a certain style. 3. **Deficiencies of existing text - driven style transfer methods**: - Most general - purpose style transfer research is limited to using reference images as style indicators, and these methods lack creativity and flexibility. - Although some text - driven style transfer methods have shown promising results, they usually require expensive data collection and annotation, or online optimization every time the content and style change. ### Main contributions of the paper To solve the above problems, the paper proposes an artist - style transfer model based on contrastive learning (CLAST), with the following main contributions: 1. **Embed task - independent CLIP image - text model**: - Embed the task - independent CLIP image - text model into CLAST, enabling CLAST to obtain style preferences from text descriptions, thus making image style transfer more interactive. 2. **Propose the adaLN - based state - space model (adaLN - SSM)**: - Explore style - content fusion and be able to efficiently model local and global feature correlations. The generated stylized images are not only statistically similar to the text descriptions but also can retain the original content. 3. **Supervised contrastive training strategy**: - Propose a supervised contrastive training strategy to align corresponding art texts and images offline, enabling the model to perform style transfer in real - time applications. 4. **Extensive experimental verification**: - Through quantitative and qualitative experiments, it is proved that CLAST is superior to existing methods in text - driven style transfer tasks. CLAST not only performs excellently in performance but also has an extremely short inference time (completing the style transfer of 512×512 images within 0.03 seconds). ### Summary By proposing the CLAST model, the paper solves the lack of flexibility and convenience in existing style transfer methods and realizes efficient text - driven artist - style transfer. This method is not only superior to existing methods in performance but also has higher efficiency and practicality in practical applications.