HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

Jiefeng Ma,Jun Du,Pengfei Hu,Zhenrong Zhang,Jianshu Zhang,Huihui Zhu,Cong Liu
2023-03-24
Abstract:The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at <a class="link-external link-https" href="https://github.com/jfma-USTC/HRDoc" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the problem of multi-page document structure reconstruction, specifically including the following aspects: 1. **Defined a new task**: The hierarchical reconstruction of document structures (HRDS) task, which aims to convert digital or scanned documents into their corresponding semantic structures. 2. **Constructed a new dataset**: Named the HRDoc dataset, it contains 2,500 multi-page documents with nearly 2 million semantic units. Each document is annotated at the line level, including category and relationship information. 3. **Proposed a baseline method**: A hierarchical document structure parsing system (DSPS) based on an encoder-decoder architecture. This system employs a multimodal bidirectional encoder and a structure-aware GRU decoder with soft masking operations to significantly improve performance. 4. **Contribution overview**: - Introduced hierarchical document structure reconstruction as a new visual and language task. - Constructed a new dataset, HRDoc, focused on fine-grained and document-level structure reconstruction of multi-page documents. - Proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS), achieving significant improvements over the baseline method through a multimodal bidirectional extractor and a structure-aware GRU decoder. The goal of the paper is to better understand and process the semantic structures of multi-page documents in the fields of natural language processing (NLP) and computer vision (CV), particularly addressing the issue of semantic structure reconstruction of multi-page documents that has been overlooked in existing research. By proposing a new dataset and baseline method, this work provides valuable resources and reference points for future research.