Abstract:Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content, and it plays an important role for various applications such as intention understanding and user recommendation. With social media posts tending to be multimodal, Multimodal Named Entity Recognition (MNER) for the text with its accompanying image is attracting more and more attention since some textual components can only be understood in combination with visual information. However, there are two drawbacks in existing approaches: 1) Meanings of the text and its accompanying image do not match always, so the text information still plays a major role. However, social media posts are usually shorter and more informal compared with other normal contents, which easily causes incomplete semantic description and the data sparsity problem. 2) Although the visual representations of whole images or objects are already used, existing methods ignore either fine-grained semantic correspondence between objects in images and words in text or the objective fact that there are misleading objects or no objects in some images. In this work, we solve the above two problems by introducing the multi-granularity cross-modality representation learning. To resolve the first problem, we enhance the representation by semantic augmentation for each word in text. As for the second issue, we perform the cross-modality semantic interaction between text and vision at the different vision granularity to get the most effective multimodal guidance representation for every word. Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets. The code, data and the best performing models are available at https://github.com/LiuPeiP-CS/IIE4MNER

Focus & Gating: A Multimodal Approach for Unveiling Relations in Noisy Social Media

On Analyzing the Role of Image for Visual-Enhanced Relation Extraction (student Abstract).

Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Towards Bridged Vision and Language: Learning Cross-modal Knowledge Representation for Relation Extraction

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Caption-Aware Multimodal Relation Extraction with Mutual Information Maximization

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Named Entity and Relation Extraction with Multi-Modal Retrieval

Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis

Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context Learners

Learning from Different Text-Image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER

Multimodal Named Entity Recognition and Relation Extraction with Retrieval-Augmented Strategy

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

CGI-MRE: A Comprehensive Genetic-Inspired Model For Multimodal Relation Extraction

Watch and Read! A Visual Relation-Aware and Textual Evidence Enhanced Model for Multimodal Relation Extraction

On Analyzing the Role of Image for Visual-enhanced Relation Extraction

Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media

Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction

Multi-granularity cross-modal representation learning for named entity recognition on social media