FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

Yuki Imajuku,Yoko Yamakata,Kiyoharu Aizawa
2024-09-27
Abstract:Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper attempts to address the problem of generating Japanese recipe texts using Multimodal Large Language Models (MLLMs). Specifically, it focuses on: 1. **Understanding Food Images**: The study aims to generate complete recipe texts, including ingredient lists and cooking steps, by inputting food images. This involves the model's ability to understand food images, including recognizing food types and estimating the ingredients used. 2. **Improving Generation Quality**: The paper enhances the quality of generated recipes by fine-tuning existing open-source MLLMs (such as LLaVA-1.5 and Phi-3 Vision) and training them on a Japanese recipe dataset. The research particularly focuses on the accuracy of the generated ingredient lists and the naturalness of the cooking steps. 3. **Evaluating Model Performance**: Researchers created a new 50-category evaluation scheme that covers the diversity of Japanese culinary culture and conducted comprehensive tests using 5,000 evaluation samples. Evaluation metrics include F1 score, sacreBLEU score, etc., to compare the performance of different models. 4. **Handling Non-Food Images**: Besides generating recipes, the study also explores how to enable the model to recognize and refuse to generate recipes when non-food images are input, thereby improving the model's robustness in practical applications. Overall, the goal of the paper is to enhance the ability to generate high-quality Japanese recipes from food images using Multimodal Large Language Models and to ensure the reliability and accuracy of the model in practical applications.