Dual-stream multi-label image classification model enhanced by feature reconstruction

Liming Hu,Mingxuan Chen,Anjie Wang,Zhijun Fang
DOI: https://doi.org/10.1007/s00530-024-01493-8
IF: 3.9
2024-09-22
Multimedia Systems
Abstract:Multi-label image classification (MLIC) is a highly practical and challenging task in computer vision. Compared to traditional single-label image classification, MLIC not only focuses on the dependencies between images and labels but also places significant emphasis on the spatial relationships within images and the internal dependencies of labels. In this paper, we propose the Dual-Stream Classification Network (DSCN) for multi-label image classification. In one branch, we capture more spatial information by segmenting the image. A feature reconstruction layer based on self-attention mechanism is used to recover the boundary information lost after segmentation, while the dependency between the image and label is captured by a transformer encoder. The other branch enhances the label's semantics using multimodal features by employing templates to extend categories into prompts, thus improving the reliability of the features. The CLIP model provides multimodal association features between images and prompts. The final labels of the images are generated by a weighted fusion of the results from the two branches. We tested our model on three popular datasets: MSCOCO2014, VOC2007 and NUS-WIDE. DSCN outperformed state-of-the-art methods, demonstrating the effectiveness of our approach.
computer science, information systems, theory & methods
What problem does this paper attempt to address?