Abstract:Recently, visual food analysis has received more and more attention in the computer vision community due to its wide application scenarios, e.g., diet nutrition management, smart restaurant, and personalized diet recommendation. Considering that food images are unstructured images with complex and unfixed visual patterns, mining food-related semantic-aware regions is crucial. Furthermore, the ingredients contained in food images are semantically related to each other due to the cooking habits and have significant semantic relationships with food categories under the hierarchical food classification ontology. Therefore, modeling the long-range semantic relationships between ingredients and the categories-ingredients semantic interactions is beneficial for ingredient recognition and food analysis. Taking these factors into consideration, we propose a multi-task learning framework for food category and ingredient recognition. This framework mainly consists of a food-orient Transformer named Convolution-Enhanced Bi-Branch Adaptive Transformer (CBiAFormer) and a multi-task category-ingredient recognition network called Structural Learning and Cross-Task Interaction (SLCI). In order to capture the complex and unfixed fine-grained patterns of food images, we propose a query-aware data-adaptive attention mechanism called Bi-Branch Adaptive Attention (BiA-Attention) in CBiAFormer, which consists of a local fine-grained branch and a global coarse-grained branch to mine local and global semantic-aware regions for different input images through an adaptive candidate key/value sets assignment for each query. Additionally, a convolutional patch embedding module is proposed to extract the fine-grained features which are neglected by Transformers. To fully utilize the ingredient information, we propose SLCI, which consists of cross-layer attention to model the semantic relationships between ingredients and two cross-task interaction modules to mine the semantic interactions between categories and ingredients. Extensive experiments show that our method achieves competitive performance on three mainstream food datasets (ETH Food-101, Vireo Food-172, and ISIA Food-200). Visualization analyses of CBiAFormer and SLCI on two tasks prove the effectiveness of our method. Codes will be released upon publication. Code and models are available at https://github.com/Liuyuxinict/CBiAFormer.

Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredients

A Study of Multi-Task and Region-Wise Deep Learning for Food Ingredient Recognition.

ChineseFoodNet: A large-scale Image Dataset for Chinese Food Recognition

Recognition of Chinese food using convolutional neural network

[Mutual reactivity of mixed lymphocyte culture form parturient mothers and their newborn infants].

Large Scale Visual Food Recognition

Gm and Km alleles in two Spanish Pyrenean populations (Andorra and Pallars Sobirà): a review of Gm variation in the Western Mediterranean basin

A Large-Scale Benchmark for Food Image Segmentation

ChinFood1000: A Large Benchmark Dataset for Chinese Food Recognition

Fine-grained recognition of Chinese food image based on DenseNet with attention mechanism

Enhanced Mask R-CNN for Chinese Food Image Detection

DeepFood: Deep Learning-Based Food Image Recognition for Computer-Aided Dietary Assessment

Convolution-Enhanced Bi-Branch Adaptive Transformer With Cross-Task Interaction for Food Category and Ingredient Recognition

Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Few-Shot And Many-Shot Fusion Learning In Mobile Visual Food Recognition

Efficient low-rank multi-component fusion with component-specific factors in image-recipe retrieval

Superpixel-Based Image Recognition For Food Images

Recognizing Multiple Ingredients in Food Images Using a Single-Ingredient Classification Model

Food Ingredients Recognition through Multi-label Learning

Fine grained food image recognition based on swin transformer

Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders