SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation

Zhenbei Wu,Qiang Wang,Jie Yang

2024-05-29

Abstract:The scarcity of free-hand sketch presents a challenging problem. Despite the emergence of some large-scale sketch datasets, these datasets primarily consist of sketches at the single-object level. There continues to be a lack of large-scale paired datasets for scene sketches. In this paper, we propose a self-supervised method for scene sketch generation that does not rely on any existing scene sketch, enabling the transformation of single-object sketches into scene sketches. To accomplish this, we introduce a method for vector sketch captioning and sketch semantic expansion. Additionally, we design a sketch generation network that incorporates a fusion of multi-modal perceptual constraints, suitable for application in zero-shot image-to-sketch downstream task, demonstrating state-of-the-art performance through experimental validation. Finally, leveraging our proposed sketch-to-sketch generation method, we contribute a large-scale dataset centered around scene sketches, comprising highly semantically consistent "text-sketch-image" triplets. Our research confirms that this dataset can significantly enhance the capabilities of existing models in sketch-based image retrieval and sketch-controlled image synthesis tasks. We will make our dataset and code publicly available.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily aims to address the scarcity of freehand sketch data and attempts to generate scene-level sketches. Specifically: - **Core Issue**: Although there are some large-scale single-object sketch datasets, there is a lack of large-scale paired scene sketch datasets. Generating complex scene sketches containing multiple objects is more challenging than generating single-object sketches. - **Proposed Method**: The paper proposes a self-supervised method to generate scene sketches. This method does not rely on existing scene sketch datasets but utilizes the semantic information in single-object sketches to generate rich scene sketches through semantic expansion. Additionally, by integrating multimodal perception constraints of text, sketches, and images, this method can be directly extended to the task of image-to-sketch generation. - **Technical Strategies**: The paper proposes three core technical strategies: - Design a GCN-based vector sketch captioning method to extract basic semantic elements from vector sketches and generate scene descriptions through semantic expansion. - Introduce a text-driven canvas layout adjustment method to adjust the layout of single-object sketches based on the expanded semantic information. - Develop a scene sketch generation method based on multiple constraint conditions, integrating semantic fusion perception, sketch object content perception, and multi-object perception constraints. - **Contribution**: This research contributes a large-scale "text-sketch-image" triplet dataset, with scene sketches as the core component, demonstrating high semantic consistency. This dataset fills an industry gap and significantly improves performance in sketch-based image retrieval and sketch-controlled image synthesis tasks by retraining existing models.

SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation

SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches

SceneSketcher-v2: Fine-Grained Scene-Level Sketch-Based Image Retrieval Using Adaptive GCNs

SketchyScene: Richly-Annotated Scene Sketches

Language-based colorization of scene sketches

Attribute-Guided Sketch Generation

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

Stroke-based semantic segmentation for scene-level free-hand sketches

Sketch-Guided Scene Image Generation

Unsupervised Scene Sketch to Photo Synthesis

Reasoning in Different Directions: Triplet Learning for Scene Graph Generation

Text-Guided Scene Sketch-to-Photo Synthesis

SketchyCOCO: Image Generation from Freehand Scene Sketches

Sketch-Specific Data Augmentation for Freehand Sketch Recognition

Deep Self-Supervised Representation Learning for Free-Hand Sketch

Open Vocabulary Semantic Scene Sketch Understanding

Self-Supervised Sketch-to-Image Synthesis

Instance-level Sketch-based Retrieval by Deep Triplet Classification Siamese Network

Generating Triples with Adversarial Networks for Scene Graph Construction

Unconstrained face sketch synthesis via perception-adaptive network and a new benchmark

Content-Conditioned Generation of Stylized Free hand Sketches