Social Event Classification Based on Multimodal Masked Transformer Network

Chen Hong,Qian Shengsheng,Li Zhangming,Fang Quan,Xu Changsheng
DOI: https://doi.org/10.59782/sidr.v2i1.122
2024-01-01
Abstract:The key to multimodal social event classification is to fully and accurately utilize the features of both image and text modalities. However, most existing methods have the following limitations: (1) they simply concatenate the image features and text features of the event, and (2) there is irrelevant contextual information between different modalities, which leads to mutual interference. Therefore, it is not enough to only consider the relationship between the modalities of multimodal data, but also the irrelevant contextual information (i.e., regions or words) between the modalities. To overcome these limitations, a novel social event classification method based on multimodal masked transformer network (MMTN) is proposed. A better representation of text and image is learned through an image-text encoding network. Then, the obtained image and text representations are input into the multimodal masked transformer network to fuse the multimodal information, and the relationship between the modalities of multimodal information is modeled by calculating the similarity between the multimodal information, masking the irrelevant context between the modalities. Extensive experiments on two benchmark datasets show that the proposed multimodal masked transformer network model achieves state-of-the-art performance.
What problem does this paper attempt to address?