Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Chongjian Yue,Xinrun Xu,Xiaojun Ma,Lun Du,Hengyu Liu,Zhiming Ding,Yanbing Jiang,Shi Han,Dongmei Zhang
DOI: https://doi.org/10.48550/arXiv.2305.16344
2024-03-07
Abstract:Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains underexplored. In this research, we specialize in harnessing the potential of LLMs to comprehend critical information from financial reports, which are hybrid long-documents. We propose an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports. To evaluate AFIE, we develop a Financial Reports Numerical Extraction (FINE) dataset and conduct an extensive experimental analysis. Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively, compared to a naive method. These results suggest that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently extract information from Hybrid Long Documents (HLDs). Specifically, the paper focuses on the capabilities of large - language models (LLMs) when dealing with hybrid documents containing text and tabular data, as these documents usually exceed the token limits of LLMs. Therefore, the authors propose a method based on the Simple Segmentation - Recombination Framework (SiReF) to enable LLMs to process HLDs, and conduct experimental analyses, exploring four important aspects: 1. **Methods for effectively selecting and summarizing useful parts in HLD**: Different summarization strategies, such as Refine and Map - Reduce, and the impact of the number of retrieved paragraphs on the results are studied. 2. **Tabular serialization format**: Four different tabular serialization formats (PLAIN, CSV, XML, and HTML) are compared, and it is found that the simplified format is sufficient for LLMs to understand tabular information. 3. **Adaptability of SiReF**: Through experiments in three dimensions, the adaptability of SiReF in different scenarios is verified, including cross - domain applications, dealing with vaguely - expressed problems, and adapting to LLMs with different capabilities. 4. **Prompt engineering**: How to improve the effectiveness of LLMs in HLDs information extraction through prompt engineering is explored, including numerical precision enhancement, keyword completion, and Few - Shot Learning. To support future related research, the paper also proposes a dataset named Financial Reports Numerical Extraction (FINE) for evaluating the numerical extraction capabilities of LLMs in financial reports.