Configurable Customized Information Extraction and Processing Pipeline
Seok Kim,Pierce Lai,Dariyan Khan,Kevin Zhao,Brian Le,Alex Luchianov,Margaret Yu,Patrick Wang
DOI: https://doi.org/10.1142/s0218001424590122
IF: 1.261
2024-08-24
International Journal of Pattern Recognition and Artificial Intelligence
Abstract:International Journal of Pattern Recognition and Artificial Intelligence, Ahead of Print. Extracting information from scanned business documents, while a necessary commercial task, continues to be mostly done manually, requiring significant human effort. Current solutions for automated document information extraction still have limited capabilities in regards to user-required customizability and extraction of dataset-specific information, leaving the area as a very active field of research. In this paper, we propose modifications and improvements to our previously developed custom pipeline for extracting and tabulating key-value pairs from commercial invoice documents. Our design changes and additions adapt the pipeline to a wider variety of document types and use cases, primarily through the implementation of dataset-specific configuration files that promote customizability along with new technical modules that address both general and dataset-specific complexities. We compare our pipeline's performance against current machine learning and commercial solutions on a real-world dataset, and demonstrate that it is able to extract a wider variety of fields while maintaining competitive or greater accuracies compared to the alternate solutions.
computer science, artificial intelligence