Abstract:Food image recognition is a challenging task that predicts the image’s food category or ingredient composition. It is an essential and fundamental step to realize automatic dietary recognition and assessment. Many food images, especially Chinese food, lack distinctive structured information and fixed semantic patterns. The current approaches have proposed to capture more discriminative features from local/attention-based perspectives to overcome this problem. However, ingredient composition is often overlooked, and it is worth noticing that the relation of food category and ingredient composition may also be conducive to image recognition. Therefore, in this paper, we propose a Region-Level Attention Network (RLA-Net) for food and ingredient joint classification, which is composed of two stage modules. More specifically, in the Feature Extraction Stage, a two-branch structure is designed to extract global food features and local-region ingredient features under the supervision of the ground-truth label. In the Relation Fusion Stage, by utilizing the mutual relationship between the food category and ingredients, we propose a Region-Weighted Module (RWM) to obtain relation fusion features for better performance. The experimental results demonstrate that our model achieves the state-of-the-art performance in ingredient recognition on the Chinese Food dataset VIREO Food-172, and the results of food classification are also competitive.

Region-Level Attention Network for Food and Ingredient Joint Recognition