Abstract:Visually rich documents (e.g. leaflets, banners, magazine articles) are physical or digital documents that utilize visual cues to augment their semantics. Information contained in these documents are ad-hoc and often incomplete. Existing works that enable structured querying on these documents do not take this into account. This makes it difficult to contextualize the information retrieved from querying these documents and gather actionable insights from them. We propose Juno -- a cross-modal entity matching framework to address this limitation. It augments heterogeneous documents with supplementary information by matching a text span in the document with semantically similar tuples from an external database. Our main contribution in this is a deep neural network with attention that goes beyond traditional keyword-based matching and finds matching tuples by aligning text spans and relational tuples on a multimodal encoding space without any prior knowledge about the document type or the underlying schema. Exhaustive experiments on multiple real-world datasets show that Juno generalizes to heterogeneous documents with diverse layouts and formats. It outperforms state-of-the-art baselines by more than 6 F1 points with up to 60% less human-labeled samples. Our experiments further show that Juno is a computationally robust framework. We can train it only once, and then adapt it dynamically for multiple resource-constrained environments without sacrificing its downstream performance. This makes it suitable for on-device deployment in various edge-devices. To the best of our knowledge, ours is the first work that investigates the information incompleteness of visually rich documents and proposes a generalizable, performant and computationally robust framework to address it in an end-to-end way.

Read Extensively, Focus Smartly: A Cross-document Semantic Enhancement Method for Visual Documents NER.

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Learning with Joint Cross-Document Information Via Multi-Task Learning for Named Entity Recognition

A Local Information Perception Enhancement–Based Method for Chinese NER

Joint Cross-document Information for Named Entity Recognition with Multi-task Learning

Cross-Modal Entity Matching for Visually Rich Documents

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Exploiting global contextual information for document-level named entity recognition

Hypergraph based Understanding for Document Semantic Entity Recognition

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Visualizing Multi-Document Semantics Via Open Domain Information Extraction

CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention

EMGE: Entities and Mentions Gradual Enhancement with semantics and connection modeling for document-level relation extraction

Modeling Entities as Semantic Points for Visual Information Extraction in the Wild

Multimodal Pre-Training Based on Graph Attention Network for Document Understanding

CSMA-CNER:Multi-modal Chinese NER Task with Cross- and Self-Modality Attention

CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognition

Focus Anywhere for Fine-grained Multi-page Document Understanding

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books

Visual-Textual Sentiment Analysis Enhanced by Hierarchical Cross-Modality Interaction