SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation

Zhenbei Wu,Qiang Wang,Jie Yang
2024-05-29
Abstract:The scarcity of free-hand sketch presents a challenging problem. Despite the emergence of some large-scale sketch datasets, these datasets primarily consist of sketches at the single-object level. There continues to be a lack of large-scale paired datasets for scene sketches. In this paper, we propose a self-supervised method for scene sketch generation that does not rely on any existing scene sketch, enabling the transformation of single-object sketches into scene sketches. To accomplish this, we introduce a method for vector sketch captioning and sketch semantic expansion. Additionally, we design a sketch generation network that incorporates a fusion of multi-modal perceptual constraints, suitable for application in zero-shot image-to-sketch downstream task, demonstrating state-of-the-art performance through experimental validation. Finally, leveraging our proposed sketch-to-sketch generation method, we contribute a large-scale dataset centered around scene sketches, comprising highly semantically consistent "text-sketch-image" triplets. Our research confirms that this dataset can significantly enhance the capabilities of existing models in sketch-based image retrieval and sketch-controlled image synthesis tasks. We will make our dataset and code publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily aims to address the scarcity of freehand sketch data and attempts to generate scene-level sketches. Specifically: - **Core Issue**: Although there are some large-scale single-object sketch datasets, there is a lack of large-scale paired scene sketch datasets. Generating complex scene sketches containing multiple objects is more challenging than generating single-object sketches. - **Proposed Method**: The paper proposes a self-supervised method to generate scene sketches. This method does not rely on existing scene sketch datasets but utilizes the semantic information in single-object sketches to generate rich scene sketches through semantic expansion. Additionally, by integrating multimodal perception constraints of text, sketches, and images, this method can be directly extended to the task of image-to-sketch generation. - **Technical Strategies**: The paper proposes three core technical strategies: - Design a GCN-based vector sketch captioning method to extract basic semantic elements from vector sketches and generate scene descriptions through semantic expansion. - Introduce a text-driven canvas layout adjustment method to adjust the layout of single-object sketches based on the expanded semantic information. - Develop a scene sketch generation method based on multiple constraint conditions, integrating semantic fusion perception, sketch object content perception, and multi-object perception constraints. - **Contribution**: This research contributes a large-scale "text-sketch-image" triplet dataset, with scene sketches as the core component, demonstrating high semantic consistency. This dataset fills an industry gap and significantly improves performance in sketch-based image retrieval and sketch-controlled image synthesis tasks by retraining existing models.