Extracting Financial Data From Unstructured Sources: Leveraging Large Language Models

Huaxia Li,Haoyun Gao,Chengzhang Wu,Miklos A. Vasarhelyi
DOI: https://doi.org/10.2139/ssrn.4567607
2023-01-01
SSRN Electronic Journal
Abstract:This research addresses the challenge of extracting financial data from unstructured sources, a persistent issue for accounting researchers, investors, and regulators. Leveraging large language models (LLMs), this study develops a framework for automated financial data extraction from PDF-formatted files. Following the design science methodology, this research develops the framework through a series of text mining and prompt engineering techniques and further applies it to governmental annual reports in PDF format. Pilot test results indicate that the framework achieves a 100% accuracy rate within a short period of time when extracting key financial indicators. This study contributes to the evolving literature on applying LLMs in accounting and finance, while also providing a practical tool for both academic and industry applications.
What problem does this paper attempt to address?