EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Haiyang Yu,Teng Fu,Bin Li,Xiangyang Xue

2024-07-24

Abstract:Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inaccurate text edge detection in scene text segmentation. Existing text segmentation methods usually introduce various text - related supervision information to improve performance, but most of them ignore the importance of text edges, while text edges are crucial for downstream tasks (such as text erasing). Therefore, this paper proposes a new model named Edge - Aware Transformers (EAFormer), aiming to segment text more accurately, especially in the text edge part. Specifically, EAFormer detects edges and filters out the edges of non - text areas by designing a text edge extractor, then proposes an edge - guided encoder to make the model pay more attention to text edges, and finally uses an MLP - based decoder to predict text masks. Experimental results show that EAFormer has verified its effectiveness on multiple commonly - used benchmarks, especially outperforming previous methods in text edge segmentation. In addition, considering that the annotations of some benchmarks (such as COCO_TS and MLT_S) are not accurate enough, the author also re - annotates these datasets to evaluate the effectiveness of the methods more fairly.

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Scene text extraction based on edges and support vector regression

SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression

A Direct Regression Scene Text Detector with Position-Sensitive Segmentation

TextFormer: Component-aware Text Segmentation with Transformer.

TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

EAST: An Efficient and Accurate Scene Text Detector

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions

Boundary-aware Arbitrary-shaped Scene Text Detector with Learnable Embedding Network

Aggregated Text Transformer for Scene Text Detection

FETNet: Feature Erasing and Transferring Network for Scene Text Removal

Weakly-Supervised Text Instance Segmentation

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

C V ] 9 M ay 2 01 8 Edit Probability for Scene Text Recognition

EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

A Scene Text Detector for Text with Arbitrary Shapes

Multi-oriented Scene Text Detection via Corner Localization and Region Segmentation

An Improved Scene Text Extraction Method Using Conditional Random Field and Optical Character Recognition

Attention-based Feature Decomposition-Reconstruction Network for Scene Text Detection

Scene text removal via cascaded text stroke detection and erasing