FMiFood: Multi-modal Contrastive Learning for Food Image Classification

Xinyue Pan,Jiangpeng He,Fengqing Zhu
2024-08-08
Abstract:Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issues present in the task of food image classification by proposing a new multimodal contrastive learning framework aimed at improving the accuracy of food image classification. The paper points out that in image-based dietary assessment, food image classification is a fundamental step, with the goal of estimating participants' nutritional intake from images of dietary occasions. Existing food image classification methods often encounter challenges of intra-class diversity and inter-class similarity, which significantly hinder classification performance. To address these issues, the authors propose a novel multimodal contrastive learning framework named FMiFood (Flexible Matching for image Classification on Food images). This framework enhances classification accuracy by integrating additional contextual information, such as textual descriptions of food categories. Specifically, FMiFood includes the following contributions: 1. **Flexible Matching Technique**: This is a novel approach that allows image patches to match with multiple text labels or not match with any labels, thereby better capturing the complexity and fine-grained details in food images. 2. **Incorporation of Classification Objectives**: The framework not only includes the objectives of contrastive learning but also introduces a separate branch for image classification objectives, utilizing soft cross-entropy loss and hard cross-entropy loss for training to optimize the model's performance on the image classification task. 3. **Enrichment of Text Descriptions**: Detailed text descriptions generated by GPT-4 are used to enrich the textual information for each food category, which helps the model better understand the subtle differences between different food images, thereby improving classification performance. The experimental section demonstrates the average classification accuracy of the FMiFood model on the UPMC-Food101 and VFN datasets and compares it with a series of baseline models, proving the effectiveness of the proposed FMiFood method. Additionally, the paper conducts ablation studies to analyze the impact of different components on overall performance and discusses potential future research directions.