Abstract:Fine-grained image datasets have small inter-class differences and large intra-class differences, which is a difficulty of the fine-grained image classification. Traditional fine-grained image classification methods only focus on the visual features of images. However, this limitation can be eliminated when these methods are improved with multimodal information. This paper proposes an improved fine-grained image classification method with multimodal information that includes multimodal data preprocessing, multimodal feature extraction, multi-temporal feature fusion and decision correction. The preprocessing method proposed solves the problems of scattered distribution, difficult processing and uneven contribution to prediction of multimodal data through normalization, packing phrases and weighted concatenating methods. When extracting multimodal features, the SAMLP (Self-Attention MLP) module proposed combines self-attention with MLP to capture the internal correlation of multimodal information. The multi-temporal feature fusion proposed is divided into early feature fusion and late feature fusion. The former refers to adding multimodal information markers to the original image, and the latter refers to designing a multi-cascade dynamic MLP structure to fuse visual features and multimodal features. In view of the limitation of feature fusion, a decision strategy is proposed to revise the prediction results of fused features according to the prediction results of multimodal features. Ablation experiment on INAT18-1K and INAT21-1K datasets shows that our method is effective in improving classification with multimodal information. Experiments on the INAT2021_mini large dataset show that the comprehensive method in this paper has higher accuracy and negligible efficiency loss compared with the state-of-the-art method.

Bi-Modal Progressive Mask Attention for Fine-Grained Recognition.

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation

MECOM: A Meta-Completion Network for Fine-Grained Recognition With Incomplete Multi-Modalities

BCRA: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval

Improving Fine-grained Image Classification with Multimodal Information

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Global Patch-wise Attention is Masterful Facilitator for Masked Image Modeling

CA‐PMG: Channel Attention and Progressive Multi‐granularity Training Network for Fine‐grained Visual Classification

Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

Mask-guided explicit feature modulation for multispectral pedestrian detection

Masked Vision and Language Modeling for Multi-modal Representation Learning

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Multimodal Masked Autoencoders Learn Transferable Representations

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

Learning Hierarchal Channel Attention for Fine-grained Visual Classification.

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation