OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

Xiongwei Wu,Sicheng Yu,Ee-Peng Lim,Chong-Wah Ngo

2024-04-02

Abstract:In the realm of food computing, segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients, the emergence of new ingredients, and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients, particularly new and diverse ones. In response to these limitations, we introduce OVFoodSeg, a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs), our approach enriches text embedding with image-specific information through two innovative modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset, setting a new milestone for food image segmentation.

Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia

What problem does this paper attempt to address?

The paper proposes a framework called OVFoodSeg to address the open vocabulary problem in food image segmentation. Existing methods mainly work under closed vocabulary and static text embedding settings, which are not efficient for handling new and diverse food ingredients. OVFoodSeg adopts an open vocabulary setting and enhances text embedding with visual context through two innovative modules, FoodLearner and Image-Informed Text Encoder. It consists of two stages, pretraining and segmentation learning, aiming to enable the model to adapt to new food ingredients unseen in the training data while maintaining high accuracy. In the pretraining stage, FoodLearner learns to align visual information with text representations related to food. In the segmentation learning stage, FoodLearner and Image-Informed Text Encoder are fine-tuned for the segmentation task. Through this approach, OVFoodSeg addresses the issues of category variations within food ingredients, emergence of new ingredients, and high annotation cost, thus improving the performance of food image segmentation. On the FoodSeg103 dataset, OVFoodSeg achieves a 4.9% improvement in average intersection over union (mIoU) compared to state-of-the-art methods like SAN, setting a new standard for food image segmentation.

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

A Large-Scale Benchmark for Food Image Segmentation

EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

Real-time and accurate model of instance segmentation of foods

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Open-Vocabulary Semantic Segmentation with Image Embedding Balancing

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

FoodSAM: Any Food Segmentation

mid-DeepLabv3+: A Novel Approach for Image Semantic Segmentation Applied to African Food Dietary Assessments

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

Large Scale Visual Food Recognition

Towards Open-Vocabulary Video Semantic Segmentation

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Chinese Dish Segmentation Based on Local Variation Driven Superpixel Grouping and Region Analysis

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

FoodMem: Near Real-time and Precise Food Video Segmentation

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Superpixel-Based Image Recognition For Food Images

A Study of Multi-Task and Region-Wise Deep Learning for Food Ingredient Recognition.