FMiFood: Multi-modal Contrastive Learning for Food Image Classification

Xinyue Pan,Jiangpeng He,Fengqing Zhu

2024-08-08

Abstract:Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the issues present in the task of food image classification by proposing a new multimodal contrastive learning framework aimed at improving the accuracy of food image classification. The paper points out that in image-based dietary assessment, food image classification is a fundamental step, with the goal of estimating participants' nutritional intake from images of dietary occasions. Existing food image classification methods often encounter challenges of intra-class diversity and inter-class similarity, which significantly hinder classification performance. To address these issues, the authors propose a novel multimodal contrastive learning framework named FMiFood (Flexible Matching for image Classification on Food images). This framework enhances classification accuracy by integrating additional contextual information, such as textual descriptions of food categories. Specifically, FMiFood includes the following contributions: 1. **Flexible Matching Technique**: This is a novel approach that allows image patches to match with multiple text labels or not match with any labels, thereby better capturing the complexity and fine-grained details in food images. 2. **Incorporation of Classification Objectives**: The framework not only includes the objectives of contrastive learning but also introduces a separate branch for image classification objectives, utilizing soft cross-entropy loss and hard cross-entropy loss for training to optimize the model's performance on the image classification task. 3. **Enrichment of Text Descriptions**: Detailed text descriptions generated by GPT-4 are used to enrich the textual information for each food category, which helps the model better understand the subtle differences between different food images, thereby improving classification performance. The experimental section demonstrates the average classification accuracy of the FMiFood model on the UPMC-Food101 and VFN datasets and compares it with a series of baseline models, proving the effectiveness of the proposed FMiFood method. Additionally, the paper conducts ablation studies to analyze the impact of different components on overall performance and discusses potential future research directions.

FMiFood: Multi-modal Contrastive Learning for Food Image Classification

Feature-Suppressed Contrast for Self-Supervised Food Pre-training

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model.

MCEN: Bridging Cross-Modal Gap Between Cooking Recipes and Dish Images with Latent Variable Model

Muti-Stage Hierarchical Food Classification

Robust Multi-Graph Contrastive Network for Incomplete Multi-View Clustering

Multi-Task Image-Based Dietary Assessment for Food Recognition and Portion Size Estimation

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Long-Tailed Continual Learning For Visual Food Recognition

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition

Single-Stage Heavy-Tailed Food Classification

Multi-Task Learning for Food Identification and Analysis with Deep Convolutional Neural Networks

UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

MuMIC -- Multimodal Embedding for Multi-label Image Classification with Tempered Sigmoid

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

A Study of Multi-Task and Region-Wise Deep Learning for Food Ingredient Recognition.

Food Ingredients Recognition through Multi-label Learning

DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment

Food Classification using Joint Representation of Visual and Textual Data

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment