FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

Yuehao Yin,Huiyan Qi,Bin Zhu,Jingjing Chen,Yu-Gang Jiang,Chong-Wah Ngo
2024-04-12
Abstract:Large Multi-modal Models (LMMs) have made impressive progress in many vision-language tasks. Nevertheless, the performance of general LMMs in specific domains is still far from satisfactory. This paper proposes FoodLMM, a versatile food assistant based on LMMs with various capabilities, including food recognition, ingredient recognition, recipe generation, nutrition estimation, food segmentation and multi-round conversation. To facilitate FoodLMM to deal with tasks beyond pure text output, we introduce a series of novel task-specific tokens and heads, enabling the model to predict food nutritional values and multiple segmentation masks. We adopt a two-stage training strategy. In the first stage, we utilize multiple public food benchmarks for multi-task learning by leveraging the instruct-following paradigm. In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain. Our fine-tuned FoodLMM achieves state-of-the-art results across several food benchmarks. We will make our code, models and datasets publicly available.
Computer Science
What problem does this paper attempt to address?
The paper attempts to address the issue that existing large multimodal models (LMMs), while performing well on general images and questions, often fail to provide reliable support in vertical domains due to a lack of domain-specific expertise, and may even produce incorrect responses or hallucinations. Specifically, in the food domain, when asked about the nutritional content of food images, general LMMs can usually only answer which nutritional elements are present but cannot provide specific quantities and precise nutritional content. To solve this problem, the paper proposes FoodLMM, a multifunctional food assistant based on large multimodal models, designed to handle various tasks including food recognition, ingredient identification, recipe generation, nutrition estimation, food segmentation, and multi-turn dialogue. By introducing a series of novel task-specific tokens and heads, the model is enabled to predict the nutritional value of food and multiple segmentation masks. The paper adopts a two-stage training strategy: the first stage utilizes multiple public food benchmark datasets for multi-task learning, and the second stage constructs multi-turn dialogue datasets and inference segmentation datasets to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex food domain reasoning. Experimental results show that the fine-tuned FoodLMM achieves state-of-the-art levels in multiple food benchmark tests.