Abstract:Given the potential applications of generating recipes from food images, this area has garnered significant attention from researchers in recent years. Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients. Large Multi-modal Models (LMMs), which have achieved notable success across a variety of vision and language tasks, shed light to generating both ingredients and instructions directly from images. Nevertheless, LMMs still face the common issue of hallucinations during recipe generation, leading to suboptimal performance. To tackle this, we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement, integrating them into the prompt to add diverse and rich context to the input image. Additionally, Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method, which demonstrates state-of-the-art (SOTA) performance in recipe generation tasks on the Recipe1M dataset.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in generating recipes from food images: 1. **Limitations of existing methods**: - **Two - stage methods**: Traditional methods usually adopt a two - stage approach. First, ingredients are predicted from the image, and then cooking instructions are generated based on the image and these ingredients. Due to limited training data and poor multimodal alignment, this method leads to unsatisfactory results. - **Problems with large multimodal models (LMMs)**: Although LMMs can directly generate ingredients and instructions from images, they still have the problem of hallucinations, that is, the generated content does not match the actual image, which affects the quality of the generated recipes. 2. **Hallucination problem**: - LMMs are prone to hallucinations when generating recipes, that is, the model will generate ingredients or steps that do not exist in the input image. For example, some models may misidentify breadcrumbs as beef, or add non - existent tomatoes and taco seasonings to the generated recipe. 3. **Lack of context information**: - Existing methods lack effective use of context, resulting in the model being unable to learn enough information to generate high - quality recipes. ### Solutions To address the above problems, the author proposes a retrieval - augmented large multimodal model, which mainly includes the following two innovative points: 1. **Stochastic Diversified Retrieval Augmentation (SDRA)**: - By retrieving recipes semantically related to the input image from the existing dataset as supplementary input, the diversity and richness of the input image are increased. This helps to reduce the hallucination phenomenon in the generation process and provides more relevant information support. 2. **Self - Consistency Ensemble Voting**: - In the inference stage, multiple candidate recipes are generated using different retrieved recipes as context, and by calculating the similarity between these candidates (such as cosine similarity, BLEU score, etc.), the most consistent recipe is selected as the final output. This mechanism can further reduce the hallucination in the generated content and improve the quality of the generated recipes. ### Experimental results The experimental results show that this method is significantly superior to existing methods on the Recipe1M dataset, especially achieving state - of - the - art performance (SOTA) in ingredient recognition and recipe generation. In addition, this model also shows strong generalization ability and surpasses the current best benchmark in ingredient recognition metrics. Through these improvements, the author has successfully solved the hallucination problem existing in the existing recipe generation methods and improved the accuracy and consistency of the generated recipes.

Retrieval Augmented Recipe Generation

MCEN: Bridging Cross-Modal Gap Between Cooking Recipes and Dish Images with Latent Variable Model

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model.

Cross-modal Recipe Retrieval with Stacked Attention Model

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Cross-Modal Recipe Retrieval: How to Cook This Dish?

Recipe Generation from Unsegmented Cooking Videos

Learning Structural Representations for Recipe Generation and Food Retrieval

Video-based Recipe Retrieval

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

Cross-Modal Recipe Retrieval with Self-Attention Mechanism

R(2)GAN: Cross-modal Recipe Retrieval with Generative Adversarial Network

Cross-modal Recipe Retrieval with Rich Food Attributes

Ingredient-enriched Recipe Generation from Cooking Videos

Self-Attention and Ingredient-Attention Based Model for Recipe Retrieval from Image Queries

Deep Understanding Of Cooking Procedure For Cross-Modal Recipe Retrieval

Deep Image-to-Recipe Translation

LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

MALM: Mask Augmentation based Local Matching for Food-Recipe Retrieval

Generating Personalized Recipes from Historical User Preferences

The Multimodal And Modular Ai Chef: Complex Recipe Generation From Imagery