Abstract:This paper focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. On the other hand, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposed Multization method effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates the multi-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.

What problem does this paper attempt to address?

The paper primarily focuses on the task of Multi-Modal Summarization (MSM), particularly targeting multi-modal data (including text and images) in product descriptions on China's JD.COM e-commerce platform. The core issue of the research is to bridge the semantic gap between different modalities (text and images) and to consider the different contributions of input text and images when generating multi-modal summaries. Specifically, the paper proposes a method called Multization, which aims to enhance multi-modal semantic information through multi-context relevance and irrelevance attention alignment. The Multization method includes the following key steps: 1. **Semantic Alignment Enhancement**: To capture shared cross-modal semantics early in the encoding stage, the paper constructs a semantic alignment enhancement mechanism for the multi-modal encoder. This includes: - First-level encoder: Extracts semantic representations of text and images. - Second-level gating mechanism: Selects the image most relevant to each text word. - Second-level multi-modal encoder: Fuses text words with their most relevant image information to obtain a multi-modal semantic alignment representation that can represent the shared semantic information of the source text and images. 2. **Relevant and Irrelevant Multi-Context Learning**: To distinguish between relevant and irrelevant information when generating multi-modal summaries, the paper proposes a relevant and irrelevant multi-context learning mechanism. This mechanism observes the relevant and irrelevant perspectives of the input source text and source images to compute relevant and irrelevant context vectors for text and images. Through this mechanism, comprehensive guidance can be effectively provided for generating multi-modal summaries. 3. **Multi-Modal Decoding**: Based on the relevant and irrelevant context vectors, the paper constructs a multi-modal decoder to generate multi-modal summaries containing text and images. This decoder not only generates the initial vocabulary probability distribution based on the relevant context vectors but also adjusts it using the irrelevant context vectors to produce the final vocabulary probability distribution. Additionally, the image most relevant to the current decoding hidden state is selected as the image summary. Experimental results show that the proposed Multization method can effectively capture the shared semantics between the source text and source images and highlight important multi-modal information. The performance of this method on the JD.COM e-commerce product description dataset demonstrates its effectiveness in multi-modal summary generation.

Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

An Unsupervised Video Summarization Method Based on Multimodal Representation.

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

MHMS: Multimodal Hierarchical Multimedia Summarization

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization

Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products

A Topic-aware Summarization Framework with Different Modal Side Information

Align vision-language semantics by multi-task learning for multi-modal summarization

D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

MMAPS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

Multi-Modal Summary Generation using Multi-Objective Optimization

Multisumm: Towards A Unified Model For Multi-Lingual Abstractive Summarization

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

KEMoS: A knowledge-enhanced multi-modal summarizing framework for Chinese online meetings

Achieving Cross Modal Generalization with Multimodal Unified Representation.

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Subtopic-Based Multimodality Ranking for Topic-Focused Multidocument Summarization.