Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Chongjian Yue,Xinrun Xu,Xiaojun Ma,Lun Du,Hengyu Liu,Zhiming Ding,Yanbing Jiang,Shi Han,Dongmei Zhang

DOI: https://doi.org/10.48550/arXiv.2305.16344

2024-03-07

Abstract:Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains underexplored. In this research, we specialize in harnessing the potential of LLMs to comprehend critical information from financial reports, which are hybrid long-documents. We propose an Automated Financial Information Extraction (AFIE) framework that enhances LLMs' ability to comprehend and extract information from financial reports. To evaluate AFIE, we develop a Financial Reports Numerical Extraction (FINE) dataset and conduct an extensive experimental analysis. Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively, compared to a naive method. These results suggest that the AFIE framework offers accuracy for automated numerical extraction from complex, hybrid documents.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently extract information from Hybrid Long Documents (HLDs). Specifically, the paper focuses on the capabilities of large - language models (LLMs) when dealing with hybrid documents containing text and tabular data, as these documents usually exceed the token limits of LLMs. Therefore, the authors propose a method based on the Simple Segmentation - Recombination Framework (SiReF) to enable LLMs to process HLDs, and conduct experimental analyses, exploring four important aspects: 1. **Methods for effectively selecting and summarizing useful parts in HLD**: Different summarization strategies, such as Refine and Map - Reduce, and the impact of the number of retrieved paragraphs on the results are studied. 2. **Tabular serialization format**: Four different tabular serialization formats (PLAIN, CSV, XML, and HTML) are compared, and it is found that the simplified format is sufficient for LLMs to understand tabular information. 3. **Adaptability of SiReF**: Through experiments in three dimensions, the adaptability of SiReF in different scenarios is verified, including cross - domain applications, dealing with vaguely - expressed problems, and adapting to LLMs with different capabilities. 4. **Prompt engineering**: How to improve the effectiveness of LLMs in HLDs information extraction through prompt engineering is explored, including numerical precision enhancement, keyword completion, and Few - Shot Learning. To support future related research, the paper also proposes a dataset named Financial Reports Numerical Extraction (FINE) for evaluating the numerical extraction capabilities of LLMs in financial reports.

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Leveraging LLMs for KPIs Retrieval from Hybrid Long-Document: A Comprehensive Framework and Dataset.

Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset

FETILDA: An Evaluation Framework for Effective Representations of Long Financial Documents

Extracting Financial Data From Unstructured Sources: Leveraging Large Language Models

Data-Centric Financial Large Language Models

A Data-Centric Approach for Financial Large Language Models with Abductive Augmentation Reasoning

FETILDA: An Effective Framework For Fin-tuned Embeddings For Long Financial Text Documents

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Evaluating Large Language Models on Financial Report Summarization: An Empirical Study

CatMemo at the FinLLM Challenge Task: Fine-Tuning Large Language Models using Data Fusion in Financial Applications

FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models

Revolutionizing Finance with LLMs: An Overview of Applications and Insights