Abstract:Nowadays, cross-modal retrieval plays an important role to flexibly find useful information across different modalities of data. Effectively measuring the similarity between different modalities of data is the key of cross-modal retrieval. Different modalities such as image and text have imbalanced and complementary relationship, and they contain unequal amount of information when describing the same semantics. For example, images often contain more details that cannot be demonstrated by textual descriptions and vice versa. Existing works based on Deep Neural Network (DNN) mostly construct one common space for different modalities, to find the latent alignments between them, which lose their exclusive modality-specific characteristics. Therefore, we propose modality-specific cross-modal similarity measurement (MCSM) approach by constructing the independent semantic space for each modality, which adopts an endto- end framework to directly generate modality-specific crossmodal similarity without explicit common representation. For each semantic space, modality-specific characteristics within one modality are fully exploited by recurrent attention network, while the data of another modality is projected into this space with attention based joint embedding, which utilizes the learned attention weights for guiding the fine-grained cross-modal correlation learning, and captures the imbalanced and complementary relationship between different modalities. Finally, the complementarity between the semantic spaces for different modalities is explored by adaptive fusion of the modality-specific cross-modal similarities to perform cross-modal retrieval. Experiments on the widely-used Wikipedia, Pascal Sentence, MS-COCO datasets as well as our constructed large-scale XMediaNet dataset verify the effectiveness of our proposed approach, outperforming 9 stateof- the-art methods.

CSMA-CNER:Multi-modal Chinese NER Task with Cross- and Self-Modality Attention

Pretraining Multi-modal Representations for Chinese NER Task with Cross-Modality Attention

A Local Information Perception Enhancement–Based Method for Chinese NER

MULTIMODAL CROSS- AND SELF-ATTENTION NETWORK FOR SPEECH EMOTION RECOGNITION

Multi-Modality Cross Attention Network for Image and Sentence Matching

Pronounce Differently, Mean Differently: A Multi-Tagging-scheme Learning Method for Chinese NER Integrated with Lexicon and Phonetic Features

Fast Neural Chinese Named Entity Recognition with Multi-head Self-attention

A Double Adversarial Network Model for Multi-Domain and Multi-Task Chinese Named Entity Recognition

CAN-NER: Convolutional Attention Network for Chinese Named Entity Recognition

CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention

Enhanced Chinese Domain Named Entity Recognition: An Approach with Lexicon Boundary and Frequency Weight Features

MSFM: Multi-view Semantic Feature Fusion Model for Chinese Named Entity Recognition.

Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

Multi-Granularity Cross-Modality Representation Learning for Named Entity Recognition on Social Media

CMNER: A Chinese Multimodal NER Dataset based on Social Media

Cross-modal Enhancement Network for Multimodal Sentiment Analysis

Cross‐modal retrieval with dual multi‐angle self‐attention

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

Chinese Named-Entity Recognition Via Self-Attention Mechanism and Position-Aware Influence Propagation Embedding

mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning

MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition