FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

Yuki Imajuku,Yoko Yamakata,Kiyoharu Aizawa

2024-09-27

Abstract:Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.

Computer Vision and Pattern Recognition,Multimedia

What problem does this paper attempt to address?

The paper attempts to address the problem of generating Japanese recipe texts using Multimodal Large Language Models (MLLMs). Specifically, it focuses on: 1. **Understanding Food Images**: The study aims to generate complete recipe texts, including ingredient lists and cooking steps, by inputting food images. This involves the model's ability to understand food images, including recognizing food types and estimating the ingredients used. 2. **Improving Generation Quality**: The paper enhances the quality of generated recipes by fine-tuning existing open-source MLLMs (such as LLaVA-1.5 and Phi-3 Vision) and training them on a Japanese recipe dataset. The research particularly focuses on the accuracy of the generated ingredient lists and the naturalness of the cooking steps. 3. **Evaluating Model Performance**: Researchers created a new 50-category evaluation scheme that covers the diversity of Japanese culinary culture and conducted comprehensive tests using 5,000 evaluation samples. Evaluation metrics include F1 score, sacreBLEU score, etc., to compare the performance of different models. 4. **Handling Non-Food Images**: Besides generating recipes, the study also explores how to enable the model to recognize and refuse to generate recipes when non-food images are input, thereby improving the model's robustness in practical applications. Overall, the goal of the paper is to enhance the ability to generate high-quality Japanese recipes from food images using Multimodal Large Language Models and to ensure the reliability and accuracy of the model in practical applications.

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

Large Language Models for Ingredient Substitution in Food Recipes using Supervised Fine-tuning and Direct Preference Optimization

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

The Multimodal And Modular Ai Chef: Complex Recipe Generation From Imagery

Retrieval Augmented Recipe Generation

70B-parameter large language models in Japanese medical question-answering

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

Multi-modal Cooking Workflow Construction for Food Recipes

Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese

Identifying and Decomposing Compound Ingredients in Meal Plans Using Large Language Models

Real-world cooking robot system from recipes based on food state recognition using foundation models and PDDL

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

LLMGA: Multimodal Large Language Model based Generation Assistant

Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

A Survey on Multimodal Large Language Models

Cook-Gen: Robust Generative Modeling of Cooking Actions from Recipes

AiGen-FoodReview: A Multimodal Dataset of Machine-Generated Restaurant Reviews and Images on Social Media