CTNR: Compress-then-Reconstruct Approach for Multimodal Abstractive Summarization

Chenxi Zhang,Zijian Zhang,Jiangfeng Li,Qin Liu,Hongming Zhu
DOI: https://doi.org/10.1109/IJCNN52387.2021.9534082
2021-01-01
Abstract:With the rapid growth of multimodal data in social medias and the huge requirement of short but abundant information. Multimodal summarization has drawn much attention in both industry and academia. It usually obtains textual summary from multiple sources by computer vision or nature language processing technologies. However, there are also two challenges in modeling such task: 1) The feature representation is limited by the non-alignment among multimodal data; 2) Massive parallel data is required during training, which is time-consuming and laborious. In this paper, we introduce an unsupervised architecture (Compress-then-Reconstruct, CTNR) to generate the summary in an end-to-end manner and a Cross-Modal Transformer module (CMTrans) to fuse the multimodal non-alignment information. Comprehensive experiments show that the proposed CTNR framework with CMTrans outperforms mainstream unsupervised approaches in terms of BLEU, ROUGE and relevance scores on MSMO and Youtube News dataset, which increase 8.82% and 11.01% on average respectively.
What problem does this paper attempt to address?