Recognize after early fusion: the Chinese food recognition based on the alignment of image and ingredients

Ruoxuan Zhang,Dantong Ouyang,Lili He,Lingjin Kuang,Hongtao Bai
DOI: https://doi.org/10.1007/s00530-024-01297-w
IF: 3.9
2024-03-28
Multimedia Systems
Abstract:As concerns about health continue to grow, more and more works are being done in the field of food computing. One of the basic topics in food computing is how to extract important information from food and analysis it from a picture. However, food recognition poses some challenges. One challenge is that the type of food is closely related to its ingredients. Another challenge is that in Chinese dietary habits, a single meal typically includes multiple dishes. But existing food image datasets only contain single-food pictures. To address these challenges, we propose our model, Recognize After Early Fusion (RAEF): the Chinese food recognition based on the alignment of image and ingredients. We use a Vision Transformer as the backbone of our model and use an early fusion model to combine visual and ingredient features. Because there are no suitable datasets for multi-label food recognition models, we propose a new Chinese food dataset named Chinsefood-130. The dataset is in https://pan.baidu.com/s/1gpjAY3JBX_wGNuCLhxLmQQ password: mr2b. After conducting experiments, we found that RAEF has great performance in both food and ingredient recognition. Compared to the performance of ViT, RAEF shows an F1 score improvement of 10% on food recognition and 12% on ingredient recognition.
computer science, information systems, theory & methods
What problem does this paper attempt to address?