Jingjing Wu,Zhengyao Fang,Pengyuan Lyu,Chengquan Zhang,Fanglin Chen,Guangming Lu,Wenjie Pei
Abstract:Transcription-only Supervised Text Spotting aims to learn text spotters relying only on transcriptions but no text boundaries for supervision, thus eliminating expensive boundary annotation. The crux of this task lies in locating each transcription in scene text images without location annotations. In this work, we formulate this challenging problem as a Weakly Supervised Cross-modality Contrastive Learning problem, and design a simple yet effective model dubbed WeCromCL that is able to detect each transcription in a scene image in a weakly supervised manner. Unlike typical methods for cross-modality contrastive learning that focus on modeling the holistic semantic correlation between an entire image and a text description, our WeCromCL conducts atomistic contrastive learning to model the character-wise appearance consistency between a text transcription and its correlated region in a scene image to detect an anchor point for the transcription in a weakly supervised manner. The detected anchor points by WeCromCL are further used as pseudo location labels to guide the learning of text spotting. Extensive experiments on four challenging benchmarks demonstrate the superior performance of our model over other methods. Code will be released.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the problem of scene text detection and recognition in the scenario where only text transcription (transcription - only) is relied on without text boundary annotation, that is, **text localization and recognition supervised only by text content**. Traditional methods usually rely on accurate text boundary annotations, which require a large amount of manual annotation work, are costly and time - consuming. Therefore, this paper proposes a new method - **WeCromCL (Weakly Supervised Cross - Modality Contrastive Learning)** to reduce the dependence on boundary annotations and thus reduce the annotation cost.
Specifically, this paper solves the following two key problems:
1. **How to accurately locate text instances without text boundary annotations**:
- In the scene text image, the location information of each text instance is unknown. Traditional fully - supervised methods require accurate bounding box annotations to train the model, but these annotations are very expensive. To solve this problem, the author decomposes the task into two stages:
- **Stage 1**: Use weakly supervised cross - modality contrastive learning to detect the anchor point of each text transcription. These anchor points serve as pseudo - position labels.
- **Stage 2**: Use the pseudo - position labels obtained in stage 1 to train a single - point supervised text spotter, thereby achieving text detection and recognition.
2. **How to effectively learn the character - level appearance consistency between text transcription and its related image regions**:
- To achieve this, the author designs a simple and effective model WeCromCL. This model learns the character - level appearance consistency between text transcription and its related image regions through atomistic contrastive learning. Specifically, WeCromCL generates an activation map by calculating the similarity between the text transcription and each pixel in the image, and determines the anchor point according to the peak position in the activation map.
Through these two stages, WeCromCL can effectively detect and recognize text instances in the scene without relying on text boundary annotations, and has achieved excellent performance on multiple benchmark datasets.
### Summary
The main contributions of this paper are:
- Proposing a two - stage framework, decomposing the problem of text localization and recognition supervised only by text content into weakly - supervised text detection and single - point supervised text detection.
- Designing a simple and effective model WeCromCL for weakly - supervised atomistic cross - modality contrastive learning, which can learn the character - level appearance consistency between text transcription and its related image regions.
- Experimental results show that this method outperforms existing methods on four challenging benchmark datasets.
Through this method, the author has successfully reduced the dependence on expensive boundary annotations, reduced the annotation cost, while maintaining high detection and recognition accuracy.