Multi-Modal Multi-Instance Multi-Label Learning with Graph Convolutional Network

Cheng Hang,Wei Wang,De-Chuan Zhan
DOI: https://doi.org/10.1109/ijcnn52387.2021.9534428
2021-01-01
Abstract:When applying machine learning to tackle realworld problems, it is common to see that objects come with multiple labels rather than a single label. In addition, complex objects can be composed of multiple modalities, e.g. a post on social media may contain both texts and images. Previous approaches typically treat every modality as a whole, while it is not the case in real world, as every post may contain multiple images and texts with quite diverse semantic meanings. Therefore, Multi-modal Multi-instance Multi-label (M3) learning was proposed. Previous attempt at M3 learning argues that exploiting label correlations is crucial. In this paper, we find that we can handle M3 problems using graph convolutional network. Specifically, a graph is built over all labels and each label is initially represented by its word embedding. The main goal for GCN is to map those label embed dings into inter-correlated label classifiers. Moreover, multi-instance aggregation is based on attention mechanism, making it more interpretable because it naturally learns to discover which pattern triggers the labels. Empirical studies are conducted on both benchmark datasets and industrial datasets, validating the effectiveness of our method, and it is demonstrated in ablation studies that the components in our methods are essential.
What problem does this paper attempt to address?