ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents

Furkan Pala,Mehmet Yasin Akpınar,Onur Deniz,Gülşen Eryiğit
2024-09-23
Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to address the challenge of extracting key information from unstructured financial documents. Specifically, the paper proposes a new model named ViBERTgrid BiLSTM - CRF, aiming to improve the performance of multi - modal key information extraction (KIE) on unstructured documents. The following are the main problems and objectives of this research: ### 1. Research Background - **Existing Research**: Multi - modal key information extraction models have been widely studied on semi - structured documents, but research on unstructured documents is still in its emerging stage. - **Challenge**: Unstructured documents lack a predefined structure and require deeper language understanding capabilities when extracting key information. ### 2. Research Objectives - **Improve Performance**: By combining ViBERTgrid (a multi - modal transformer) and BiLSTM - CRF (a sequence - labeling model), improve the performance of named - entity recognition (NER) from unstructured financial documents. - **Verify Universality**: Ensure that the proposed architecture not only performs well on unstructured documents but also maintains its performance on semi - structured documents. - **Dataset Contribution**: Provide public token - level annotations for the SROIE dataset to promote the research of multi - modal sequence - labeling models. ### 3. Method Innovation - **ViBERTgrid BiLSTM - CRF**: Combines the visual and text feature extraction capabilities of ViBERTgrid and the syntactic and long - term context - aware capabilities of BiLSTM - CRF. - **Experimental Design**: Use two datasets - SROIE (semi - structured receipt dataset) and UTD/UMTD (unstructured transfer order dataset), and verify the effectiveness of the model through multiple evaluation metrics. ### 4. Main Contributions - **Performance Improvement**: The performance has been improved by approximately 2 percentage points in the named - entity recognition task of unstructured financial documents. - **Universality Verification**: Prove the universality and effectiveness of the proposed architecture on semi - structured and unstructured documents. - **Dataset Release**: Release the token - level annotations of the SROIE dataset, promoting the development of the multi - modal information extraction field. In summary, this paper mainly addresses the problem of how to efficiently and accurately extract key information from unstructured financial documents, and by introducing a new model architecture and dataset annotations, provides a basis for further research in this field.