A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports

Xinyu Wang,Lin Gui,Yulan He
2023-10-27
Abstract:Table of contents (ToC) extraction centres on structuring documents in a hierarchical manner. In this paper, we propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022. These reports pose significant challenges due to their diverse structures and extensive length. To address these challenges, we propose a new framework for Toc extraction, consisting of three steps: (1) Constructing an initial tree of text blocks based on reading order and font sizes; (2) Modelling each tree node (or text block) independently by considering its contextual information captured in node-centric subtree; (3) Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move). This construction-modelling-modification (CMM) process offers several benefits. It eliminates the need for pairwise modelling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-the-art baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length.
Computation and Language
What problem does this paper attempt to address?
The paper primarily addresses the issue of automatically extracting the Table of Contents (ToC) from complex corporate Environmental, Social, and Governance (ESG) annual reports. ESG reports are often structurally diverse, lengthy, and contain a multitude of visual elements, making it difficult for traditional document understanding methods to be directly applied. To tackle this challenge, the authors have made the following contributions: 1. **New Dataset ESGDoc**: A collection of 1093 public ESG annual reports from 563 companies, ranging from 2001 to 2022, specifically for the ToC extraction task. 2. **A Novel Framework Proposed**: The framework includes three steps: - Building an initial text block tree based on reading order and font size; - Independently modeling each tree node, considering the context information captured within the subtree centered on the node; - Modifying the original tree, taking appropriate actions (keep, delete, or move) for each tree node to optimize the ToC structure. 3. **Improved Document Segmentation and Modeling Approach**: Implemented through Graph Neural Networks (GNN), it can preserve local and long-distance information within each segment, effectively handling documents of any length. 4. **Experimental Results**: The proposed CMM (Construction-Modelling-Modification) framework significantly outperforms the previous state-of-the-art baseline method MTD in terms of runtime and the ability to process long documents when dealing with the ESGDoc dataset. Moreover, CMM demonstrates excellent performance in both accuracy and efficiency of ToC extraction, especially when handling the ESGDoc dataset, proving its superiority in dealing with complex structures and lengthy documents. In summary, the paper aims to overcome the limitations of existing ToC extraction methods in handling complex documents such as ESG reports by constructing a specialized dataset and developing an innovative multi-step framework, thereby improving the accuracy and efficiency of ToC extraction.