High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented Transformer

Mingxian Li,Hao Sun,Yingtie Lei,Xiaofeng Zhang,Yihang Dong,Yilin Zhou,Zimeng Li,Xuhang Chen
2024-10-30
Abstract:Document images are often degraded by various stains, significantly impacting their readability and hindering downstream applications such as document digitization and analysis. The absence of a comprehensive stained document dataset has limited the effectiveness of existing document enhancement methods in removing stains while preserving fine-grained details. To address this challenge, we construct StainDoc, the first large-scale, high-resolution ($2145\times2245$) dataset specifically designed for document stain removal. StainDoc comprises over 5,000 pairs of stained and clean document images across multiple scenes. This dataset encompasses a diverse range of stain types, severities, and document backgrounds, facilitating robust training and evaluation of document stain removal algorithms. Furthermore, we propose StainRestorer, a Transformer-based document stain removal approach. StainRestorer employs a memory-augmented Transformer architecture that captures hierarchical stain representations at part, instance, and semantic levels via the DocMemory module. The Stain Removal Transformer (SRTransformer) leverages these feature representations through a dual attention mechanism: an enhanced spatial attention with an expanded receptive field, and a channel attention captures channel-wise feature importance. This combination enables precise stain removal while preserving document content integrity. Extensive experiments demonstrate StainRestorer's superior performance over state-of-the-art methods on the StainDoc dataset and its variants StainDoc\_Mark and StainDoc\_Seal, establishing a new benchmark for document stain removal. Our work highlights the potential of memory-augmented Transformers for this task and contributes a valuable dataset to advance future research.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is the impact of various stains on document images on readability and downstream applications such as document digitization and analysis. Specifically, existing document enhancement methods have limited effectiveness in removing stains while preserving fine-grained details, mainly due to the lack of a comprehensive, high-resolution dataset containing various types of stains. To tackle this challenge, the authors constructed the StainDoc dataset and proposed a memory-enhanced Transformer-based document stain removal method called StainRestorer. ### Main Issues 1. **Stain Problems in Document Images**: - Stains severely affect the readability and visual quality of documents, hindering research and applications such as Optical Character Recognition (OCR). - Traditional document enhancement methods often lack the precise handling capability for fine-grained information when dealing with stains, especially when stains overlap with text or image edges. 2. **Lack of High-Quality Datasets**: - The absence of a large, high-resolution dataset containing various types and severities of stains limits the effectiveness of existing document enhancement methods. ### Solutions 1. **Constructing the StainDoc Dataset**: - StainDoc is the first large, high-resolution dataset specifically designed for document stain removal, containing over 5,000 pairs of stained and clean document images, covering various scenarios, stain types, severities, and document backgrounds. 2. **Proposing the StainRestorer Model**: - StainRestorer is a memory-enhanced Transformer-based document stain removal method that captures multi-level stain representations through the DocMemory module, including part-level, instance-level, and semantic-level features. - The Stain Removal Transformer (SRTransformer) utilizes these rich feature representations to achieve precise stain removal through a dual attention mechanism (enhanced spatial attention and channel attention) while preserving the integrity of the document content. ### Contributions 1. **Constructing the StainDoc Dataset**: - Provides a large, high-resolution dataset containing over 5,000 pairs of stained and clean document images, filling a gap in the document enhancement field. 2. **Proposing the DocMemory Module**: - Extracts and analyzes deep features of different granularities in documents through a series of Memory Units, capturing multi-level stain representations. 3. **Proposing the Stain Removal Transformer (SRTransformer)**: - Utilizes rich multi-level stain representations to achieve precise stain removal through enhanced spatial attention and channel attention mechanisms while maintaining the integrity of the document content. Through these contributions, this paper not only provides an important data resource but also proposes an effective document stain removal method, significantly improving the performance of document stain removal.