Abstract:Table of contents (ToC) extraction centres on structuring documents in a hierarchical manner. In this paper, we propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022. These reports pose significant challenges due to their diverse structures and extensive length. To address these challenges, we propose a new framework for Toc extraction, consisting of three steps: (1) Constructing an initial tree of text blocks based on reading order and font sizes; (2) Modelling each tree node (or text block) independently by considering its contextual information captured in node-centric subtree; (3) Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move). This construction-modelling-modification (CMM) process offers several benefits. It eliminates the need for pairwise modelling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-the-art baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length.

What problem does this paper attempt to address?

The paper primarily addresses the issue of automatically extracting the Table of Contents (ToC) from complex corporate Environmental, Social, and Governance (ESG) annual reports. ESG reports are often structurally diverse, lengthy, and contain a multitude of visual elements, making it difficult for traditional document understanding methods to be directly applied. To tackle this challenge, the authors have made the following contributions: 1. **New Dataset ESGDoc**: A collection of 1093 public ESG annual reports from 563 companies, ranging from 2001 to 2022, specifically for the ToC extraction task. 2. **A Novel Framework Proposed**: The framework includes three steps: - Building an initial text block tree based on reading order and font size; - Independently modeling each tree node, considering the context information captured within the subtree centered on the node; - Modifying the original tree, taking appropriate actions (keep, delete, or move) for each tree node to optimize the ToC structure. 3. **Improved Document Segmentation and Modeling Approach**: Implemented through Graph Neural Networks (GNN), it can preserve local and long-distance information within each segment, effectively handling documents of any length. 4. **Experimental Results**: The proposed CMM (Construction-Modelling-Modification) framework significantly outperforms the previous state-of-the-art baseline method MTD in terms of runtime and the ability to process long documents when dealing with the ESGDoc dataset. Moreover, CMM demonstrates excellent performance in both accuracy and efficiency of ToC extraction, especially when handling the ESGDoc dataset, proving its superiority in dealing with complex structures and lengthy documents. In summary, the paper aims to overcome the limitations of existing ToC extraction methods in handling complex documents such as ESG reports by constructing a specialized dataset and developing an innovative multi-step framework, thereby improving the accuracy and efficiency of ToC extraction.

A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Advanced Unstructured Data Processing for ESG Reports: A Methodology for Structured Transformation and Enhanced Analysis

Multimodal Tree Decoder for Table of Contents Extraction in Document Images.

ESGReveal: An LLM-based approach for extracting structured data from ESG reports

TOC Structure Extraction from OCR-ed Books.

Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents

ESG-FTSE: A corpus of news articles with ESG relevance labels and use cases

Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs

CED: Catalog Extraction from Documents

READoc: A Unified Benchmark for Realistic Document Structured Extraction

Unfolding the Transitions in Sustainability Reporting

DSG: An End-to-End Document Structure Generator

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Automatic ESG Assessment of Companies by Mining and Evaluating Media Coverage Data: NLP Approach and Tool

Paradigm Shift in Sustainability Disclosure Analysis: Empowering Stakeholders with CHATREPORT, a Language Model-Based Tool

Modeling the Evolutionary Trends in Corporate ESG Reporting: A Study based on Knowledge Management Model

The future of document indexing: GPT and Donut revolutionize table of content processing

Glitter or gold? Deriving structured insights from sustainability reports via large language models

Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques

SCTc-TE: A Comprehensive Formulation and Benchmark for Temporal Event Forecasting