Abstract:Multimedia data with various modalities, such as image and text, are huge in quantity but have inconsistent distribution and representation. Many works have been done to break the boundary between image and text to measure their correlation. However, they focus on either the transformation to common subspace or the unidirectional generation from one to another individually, which cannot fully explore their interactions. It is noted that the bidirectional generation between image and text not only can provide complementary hints and mutually boost to learn cross-modal correlation but also cross-modal correlation learning can feed back to give comprehensive clues for promoting the cross-modal generation process. Therefore, we have the motivation that information transmission between image and text should be treated as a circular process, which aims to fully understand their latent correlation, and further realize cross-modal generation to produce both realistic images and text descriptions in a unified framework. In this paper, we propose the cross-modal circular correlation learning approach to perform both cross-modal correlation learning and generation simultaneously through an efficient circular learning training procedure. First, we propose the cross-modal circular learning model to perform an image-to-text caption and text-to-image synthesis circularly and learn common representation as a round-trip bridge, which can realize efficient interactions to fully exploit latent cross-modal correlations. Second, a unified bidirectional framework is proposed to conduct cross-modal mutual generation and is trained in an efficient circular process to enhance the generative ability of common representation, which can feed back circularly to further promote cross-modal correlation learning. In summary, we simultaneously perform cross-modal retrieval, image-to-text caption, and text-to-image synthesis in a unified framework with the circular learning process, which has high scalability and generality to realize universal cognition on the cross-modal data. We conduct extensive experiments to not only evaluate the correlation performance by cross-modal retrieval but also to show the generation effectiveness of both image caption and synthesis on the MS-COCO dataset.

Cross-modal Bidirectional Translation Via Reinforcement Learning

Reinforced Cross-Media Correlation Learning by Context-Aware Bidirectional Translation

Cross-Lingual Text Image Recognition Via Multi-Task Sequence to Sequence Learning.

Unpaired Multimodal Neural Machine Translation via Reinforcement Learning

Image Cross-Domain Translation Algorithm Based on Self-Similarity and Contrastive Learning

Multimodal Image-to-Image Translation via Mutual Information Estimation and Maximization

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

Show and Tell in the Loop: Cross-Modal Circular Correlation Learning

Dual-View Curricular Optimal Transport for Cross-Lingual Cross-Modal Retrieval

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Bilingual–Visual Consistency for Multimodal Neural Machine Translation

Improving Cross-Modal Image-Text Retrieval With Teacher-Student Learning

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval

Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation

Image-to-Image Translation with Multi-Path Consistency Regularization

Enhancing Image Description Generation through Deep Reinforcement Learning: Fusing Multiple Visual Features and Reward Mechanisms

Improving Braille-Chinese translation with jointly trained and pre-trained language models

Visual Agreement Regularized Training for Multi-Modal Machine Translation

Learning Inter-Related Statistical Query Translation Models for English-Chinese Bi-Directional CLIR

Good for Misconceived Reasons: An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation