Learnable Hierarchical Label Embedding and Grouping for Visual Intention Understanding

QingHongYa Shi,Mang Ye,Ziyi Zhang,Bo Du
DOI: https://doi.org/10.1109/taffc.2023.3247876
IF: 13.99
2023-01-01
IEEE Transactions on Affective Computing
Abstract:Visual intention understanding is to mine the potential and subjective intention behind the images, which includes the user's hidden emotions and perspectives. Due to the label ambiguity, this paper presents a novel learnable Hierarchical Label Embedding and Grouping (HLEG). It is featured in three aspects: 1) For effectively mining the underlying meaning of images, we build a hierarchical transformer structure to model the hierarchy of labels, formulating a multi-level classification scheme. 2) For the label ambiguity issue, we design a novel learnable label embedding with accumulative grouping integrated into the hierarchical structure, which does not require additional annotation. 3) For multi-level classification, we propose a “Hard-First” optimization strategy to adaptively adjust the classification optimization at different levels, avoiding over-classification of the coarse labels. HLEG enhances the F1 score (average +1.24%) and mAP (average +1.48%) on Intentonomy over prominent baseline models. Comprehensive experiments validate the superiority of our proposed method, achieving state-of-the-art performance under various settings. Code is available at https://github.com/ShiQingHongYa/HLEG .
What problem does this paper attempt to address?