Abstract:Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of scene text detection and recognition in the scenario where only text transcription (transcription - only) is relied on without text boundary annotation, that is, **text localization and recognition supervised only by text content**. Traditional methods usually rely on accurate text boundary annotations, which require a large amount of manual annotation work, are costly and time - consuming. Therefore, this paper proposes a new method - **WeCromCL (Weakly Supervised Cross - Modality Contrastive Learning)** to reduce the dependence on boundary annotations and thus reduce the annotation cost. Specifically, this paper solves the following two key problems: 1. **How to accurately locate text instances without text boundary annotations**: - In the scene text image, the location information of each text instance is unknown. Traditional fully - supervised methods require accurate bounding box annotations to train the model, but these annotations are very expensive. To solve this problem, the author decomposes the task into two stages: - **Stage 1**: Use weakly supervised cross - modality contrastive learning to detect the anchor point of each text transcription. These anchor points serve as pseudo - position labels. - **Stage 2**: Use the pseudo - position labels obtained in stage 1 to train a single - point supervised text spotter, thereby achieving text detection and recognition. 2. **How to effectively learn the character - level appearance consistency between text transcription and its related image regions**: - To achieve this, the author designs a simple and effective model WeCromCL. This model learns the character - level appearance consistency between text transcription and its related image regions through atomistic contrastive learning. Specifically, WeCromCL generates an activation map by calculating the similarity between the text transcription and each pixel in the image, and determines the anchor point according to the peak position in the activation map. Through these two stages, WeCromCL can effectively detect and recognize text instances in the scene without relying on text boundary annotations, and has achieved excellent performance on multiple benchmark datasets. ### Summary The main contributions of this paper are: - Proposing a two - stage framework, decomposing the problem of text localization and recognition supervised only by text content into weakly - supervised text detection and single - point supervised text detection. - Designing a simple and effective model WeCromCL for weakly - supervised atomistic cross - modality contrastive learning, which can learn the character - level appearance consistency between text transcription and its related image regions. - Experimental results show that this method outperforms existing methods on four challenging benchmark datasets. Through this method, the author has successfully reduced the dependence on expensive boundary annotations, reduced the annotation cost, while maintaining high detection and recognition accuracy.

WeCromCL: Weakly Supervised Cross-Modality Contrastive Learning for Transcription-only Supervised Text Spotting

Turning a CLIP Model into a Scene Text Spotter

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Turning a CLIP Model into a Scene Text Detector

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

CLIM: Contrastive Language-Image Mosaic for Region Representation

Context-Based Contrastive Learning for Scene Text Recognition

CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly Supervised Text-based Person Re-Identification

CSS-LM: A Contrastive Framework for Semi-Supervised Fine-Tuning of Pre-Trained Language Models

WeText: Scene Text Detection under Weak Supervision

TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

MuSCLe: A Multi-Strategy Contrastive Learning Framework for Weakly Supervised Semantic Segmentation

Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior Understanding

AsCL: An Asymmetry-sensitive Contrastive Learning Method for Image-Text Retrieval with Cross-Modal Fusion

Multimodal contrastive learning for spatial gene expression prediction using histology images

RegionCL: Can Simple Region Swapping Contribute to Contrastive Learning?

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis