Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

Huan Rong,Zhongfeng Chen,Zhenyu Lu,Fan Xu,Victor S. Sheng
DOI: https://doi.org/10.1145/3651983
IF: 1.471
2024-03-09
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:This paper focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. On the other hand, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposed Multization method effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates the multi-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.
computer science, artificial intelligence
What problem does this paper attempt to address?
The paper primarily focuses on the task of Multi-Modal Summarization (MSM), particularly targeting multi-modal data (including text and images) in product descriptions on China's JD.COM e-commerce platform. The core issue of the research is to bridge the semantic gap between different modalities (text and images) and to consider the different contributions of input text and images when generating multi-modal summaries. Specifically, the paper proposes a method called Multization, which aims to enhance multi-modal semantic information through multi-context relevance and irrelevance attention alignment. The Multization method includes the following key steps: 1. **Semantic Alignment Enhancement**: To capture shared cross-modal semantics early in the encoding stage, the paper constructs a semantic alignment enhancement mechanism for the multi-modal encoder. This includes: - First-level encoder: Extracts semantic representations of text and images. - Second-level gating mechanism: Selects the image most relevant to each text word. - Second-level multi-modal encoder: Fuses text words with their most relevant image information to obtain a multi-modal semantic alignment representation that can represent the shared semantic information of the source text and images. 2. **Relevant and Irrelevant Multi-Context Learning**: To distinguish between relevant and irrelevant information when generating multi-modal summaries, the paper proposes a relevant and irrelevant multi-context learning mechanism. This mechanism observes the relevant and irrelevant perspectives of the input source text and source images to compute relevant and irrelevant context vectors for text and images. Through this mechanism, comprehensive guidance can be effectively provided for generating multi-modal summaries. 3. **Multi-Modal Decoding**: Based on the relevant and irrelevant context vectors, the paper constructs a multi-modal decoder to generate multi-modal summaries containing text and images. This decoder not only generates the initial vocabulary probability distribution based on the relevant context vectors but also adjusts it using the irrelevant context vectors to produce the final vocabulary probability distribution. Additionally, the image most relevant to the current decoding hidden state is selected as the image summary. Experimental results show that the proposed Multization method can effectively capture the shared semantics between the source text and source images and highlight important multi-modal information. The performance of this method on the JD.COM e-commerce product description dataset demonstrates its effectiveness in multi-modal summary generation.