FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

Yuehao Yin,Huiyan Qi,Bin Zhu,Jingjing Chen,Yu-Gang Jiang,Chong-Wah Ngo

2024-04-12

Abstract:Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.

Computer Science

What problem does this paper attempt to address?

The paper attempts to address the issue that existing large multimodal models (LMMs), while performing well on general images and questions, often fail to provide reliable support in vertical domains due to a lack of domain-specific expertise, and may even produce incorrect responses or hallucinations. Specifically, in the food domain, when asked about the nutritional content of food images, general LMMs can usually only answer which nutritional elements are present but cannot provide specific quantities and precise nutritional content. To solve this problem, the paper proposes FoodLMM, a multifunctional food assistant based on large multimodal models, designed to handle various tasks including food recognition, ingredient identification, recipe generation, nutrition estimation, food segmentation, and multi-turn dialogue. By introducing a series of novel task-specific tokens and heads, the model is enabled to predict the nutritional value of food and multiple segmentation masks. The paper adopts a two-stage training strategy: the first stage utilizes multiple public food benchmark datasets for multi-task learning, and the second stage constructs multi-turn dialogue datasets and inference segmentation datasets to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex food domain reasoning. Experimental results show that the fine-tuned FoodLMM achieves state-of-the-art levels in multiple food benchmark tests.

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

FoodSky: A Food-oriented Large Language Model that Passes the Chef and Dietetic Examination

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Large Language Models for Ingredient Substitution in Food Recipes using Supervised Fine-tuning and Direct Preference Optimization

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Multi-Task Image-Based Dietary Assessment for Food Recognition and Portion Size Estimation

A Large-Scale Benchmark for Food Image Segmentation

Large Scale Visual Food Recognition

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

FMiFood: Multi-modal Contrastive Learning for Food Image Classification

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

RWMF: A Real-World Multimodal Foodlog Database.

A Study of Multi-Task and Region-Wise Deep Learning for Food Ingredient Recognition.