Abstract:Multimodal key information extraction (KIE) models have been studied extensively on semi-structured documents. However, their investigation on unstructured documents is an emerging research topic. The paper presents an approach to adapt a multimodal transformer (i.e., ViBERTgrid previously explored on semi-structured documents) for unstructured financial documents, by incorporating a BiLSTM-CRF layer. The proposed ViBERTgrid BiLSTM-CRF model demonstrates a significant improvement in performance (up to 2 percentage points) on named entity recognition from unstructured documents in financial domain, while maintaining its KIE performance on semi-structured documents. As an additional contribution, we publicly released token-level annotations for the SROIE dataset in order to pave the way for its use in multimodal sequence labeling models.

What problem does this paper attempt to address?

This paper attempts to address the challenge of extracting key information from unstructured financial documents. Specifically, the paper proposes a new model named ViBERTgrid BiLSTM - CRF, aiming to improve the performance of multi - modal key information extraction (KIE) on unstructured documents. The following are the main problems and objectives of this research: ### 1. Research Background - **Existing Research**: Multi - modal key information extraction models have been widely studied on semi - structured documents, but research on unstructured documents is still in its emerging stage. - **Challenge**: Unstructured documents lack a predefined structure and require deeper language understanding capabilities when extracting key information. ### 2. Research Objectives - **Improve Performance**: By combining ViBERTgrid (a multi - modal transformer) and BiLSTM - CRF (a sequence - labeling model), improve the performance of named - entity recognition (NER) from unstructured financial documents. - **Verify Universality**: Ensure that the proposed architecture not only performs well on unstructured documents but also maintains its performance on semi - structured documents. - **Dataset Contribution**: Provide public token - level annotations for the SROIE dataset to promote the research of multi - modal sequence - labeling models. ### 3. Method Innovation - **ViBERTgrid BiLSTM - CRF**: Combines the visual and text feature extraction capabilities of ViBERTgrid and the syntactic and long - term context - aware capabilities of BiLSTM - CRF. - **Experimental Design**: Use two datasets - SROIE (semi - structured receipt dataset) and UTD/UMTD (unstructured transfer order dataset), and verify the effectiveness of the model through multiple evaluation metrics. ### 4. Main Contributions - **Performance Improvement**: The performance has been improved by approximately 2 percentage points in the named - entity recognition task of unstructured financial documents. - **Universality Verification**: Prove the universality and effectiveness of the proposed architecture on semi - structured and unstructured documents. - **Dataset Release**: Release the token - level annotations of the SROIE dataset, promoting the development of the multi - modal information extraction field. In summary, this paper mainly addresses the problem of how to efficiently and accurately extract key information from unstructured financial documents, and by introducing a new model architecture and dataset annotations, provides a basis for further research in this field.

ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

FETILDA: An Evaluation Framework for Effective Representations of Long Financial Documents

XLM-RoBERTa Model for Key Information Extraction on Military Document

FETILDA: An Effective Framework For Fin-tuned Embeddings For Long Financial Text Documents

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

Enabling and Analyzing How to Efficiently Extract Information from Hybrid Long Documents with LLMs

Performance Evaluation of Word and Sentence Embeddings for Finance Headlines Sentiment Analysis

VKIE: The Application of Key Information Extraction on Video Text

Key Information Extraction From Documents: Evaluation And Generator

GenKIE: Robust Generative Multimodal Document Key Information Extraction

MatViX: Multimodal Information Extraction from Visually Rich Articles

IPerFEX-2023: Indonesian personal financial entity extraction using indoBERT-BiGRU-CRF model

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

GraphRevisedIE: Multimodal Information Extraction with Graph-Revised Network

FinBERT–MRC: Financial Named Entity Recognition Using BERT Under the Machine Reading Comprehension Paradigm

Application of BiLSTM-CRF model with different embeddings for product name extraction in unstructured Turkish text

Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages

MFF-CNER: A Multi-feature Fusion Model for Chinese Named Entity Recognition in Finance Securities