Abstract:The image-recipe cross-modal retrieval task, which retrieves the relevant recipes according to food images and vice versa, is now attracting widespread attention. There are two main challenges for image-recipe cross-modal retrieval task. Firstly, a recipe's different components (words in a sentence, sentences in an entity, and entities in a recipe) have different weight values. If a recipe's different components own the same weight, the recipe embeddings cannot pay more attention to the important components. As a result, the important components make less contribution to the retrieval task. Secondly, the food images have obvious properties of locality and only the local food regions matter. There are still difficulties in enhancing the discriminative local region features in the food images. To address these two problems, we propose a novel framework named Dual Cross Attention Encoders for Cross-modal Food Retrieval (DCA-Food). The proposed framework consists of a hierarchical cross attention recipe encoder (HCARE) and a cross attention image encoder (CAIE). HCARE consists of three types of cross attention modules to capture the important words in a sentence, the important sentences in an entity and the important entities in a recipe, respectively. CAIE extracts global and local region features. Then, it calculates cross attention between them to enhance the discriminative local features in the food images. We conduct the ablation studies to validate our design choices. Our proposed approach outperforms the existing approaches by a large margin on the Recipe1M dataset. Specifically, we improve the R@1 performance by +2.7 and +1.9 on the 1k and 10k testing sets, respectively.

Dividing and Conquering Cross-Modal Recipe Retrieval: from Nearest Neighbours Baselines to SoTA

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Transformer Decoders with MultiModal Regularization for Cross-Modal Food Retrieval

Cross-Modal Recipe Retrieval with Self-Attention Mechanism

MCEN: Bridging Cross-Modal Gap Between Cooking Recipes and Dish Images with Latent Variable Model

MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model.

Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning

Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism

Cross-modal Recipe Retrieval with Stacked Attention Model

Deep Understanding Of Cooking Procedure For Cross-Modal Recipe Retrieval

Cross-Modal Recipe Retrieval: How to Cook This Dish?

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders

Cross-domain Cross-modal Food Transfer.

Cross-modal Recipe Retrieval with Rich Food Attributes

Self-Attention and Ingredient-Attention Based Model for Recipe Retrieval from Image Queries

Learning From Web Recipe-Image Pairs for Food Recognition: Problem, Baselines and Performance

Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Cross-domain Food Image-to-Recipe Retrieval by Weighted Adversarial Learning