Abstract:Fine-grained image datasets have small inter-class differences and large intra-class differences, which is a difficulty of the fine-grained image classification. Traditional fine-grained image classification methods only focus on the visual features of images. However, this limitation can be eliminated when these methods are improved with multimodal information. This paper proposes an improved fine-grained image classification method with multimodal information that includes multimodal data preprocessing, multimodal feature extraction, multi-temporal feature fusion and decision correction. The preprocessing method proposed solves the problems of scattered distribution, difficult processing and uneven contribution to prediction of multimodal data through normalization, packing phrases and weighted concatenating methods. When extracting multimodal features, the SAMLP (Self-Attention MLP) module proposed combines self-attention with MLP to capture the internal correlation of multimodal information. The multi-temporal feature fusion proposed is divided into early feature fusion and late feature fusion. The former refers to adding multimodal information markers to the original image, and the latter refers to designing a multi-cascade dynamic MLP structure to fuse visual features and multimodal features. In view of the limitation of feature fusion, a decision strategy is proposed to revise the prediction results of fused features according to the prediction results of multimodal features. Ablation experiment on INAT18-1K and INAT21-1K datasets shows that our method is effective in improving classification with multimodal information. Experiments on the INAT2021_mini large dataset show that the comprehensive method in this paper has higher accuracy and negligible efficiency loss compared with the state-of-the-art method.

Multi-modal Learning for Social Image Classification

Social Image-text Sentiment Classification With Cross-Modal Consistency and Knowledge Distillation

Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations

Multimodal Classification for Analysing Social Media

Cross-Modal Image-Tag Relevance Learning for Social Images

Multimodal Learning of Social Image Representation by Exploiting Social Relations

Multi-modal microblog classification via multi-task learning

A multimodal sentiment recognition method based on attention mechanism

Improving Fine-grained Image Classification with Multimodal Information

TCGM: an Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning

CLMLF:A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection

Learning Visual Emotion Distributions via Multi-Modal Features Fusion.

Multi-Modal Multi-Label Semantic Indexing Of Images Based On Hybrid Ensemble Learning

Multimodal Sentiment Analysis Using Multi-tensor Fusion Network with Cross-modal Modeling

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Learn to Combine Modalities in Multimodal Deep Learning

Collaboration based multi-modal multi-label learning

Learning and Fusing Multimodal Features from and for Multi-task Facial Computing

Semisupervised image classification by mutual learning of multiple self‐supervised models

Multi-Modal Curriculum Learning for Semi-Supervised Image Classification

MFSC: A Multimodal Aspect-Level Sentiment Classification Framework with Multi-Image Gate and Fusion Networks