NutritionVerse-Direct: Exploring Deep Neural Networks for Multitask Nutrition Prediction from Food Images

Matthew Keller,Chi-en Amy Tai,Yuhao Chen,Pengcheng Xi,Alexander Wong
DOI: https://doi.org/10.48550/arXiv.2405.07814
2024-05-13
Abstract:Many aging individuals encounter challenges in effectively tracking their dietary intake, exacerbating their susceptibility to nutrition-related health complications. Self-reporting methods are often inaccurate and suffer from substantial bias; however, leveraging intelligent prediction methods can automate and enhance precision in this process. Recent work has explored using computer vision prediction systems to predict nutritional information from food images. Still, these methods are often tailored to specific situations, require other inputs in addition to a food image, or do not provide comprehensive nutritional information. This paper aims to enhance the efficacy of dietary intake estimation by leveraging various neural network architectures to directly predict a meal's nutritional content from its image. Through comprehensive experimentation and evaluation, we present NutritionVerse-Direct, a model utilizing a vision transformer base architecture with three fully connected layers that lead to five regression heads predicting calories (kcal), mass (g), protein (g), fat (g), and carbohydrates (g) present in a meal. NutritionVerse-Direct yields a combined mean average error score on the NutritionVerse-Real dataset of 412.6, an improvement of 25.5% over the Inception-ResNet model, demonstrating its potential for improving dietary intake estimation accuracy.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges that the elderly encounter in tracking dietary intake, especially as this problem is exacerbated by health complications. Traditional self - reporting methods are often inaccurate and have significant biases, so an automated method is required to improve the accuracy of dietary intake estimation. The authors use deep neural networks to directly predict nutrient components from food images, aiming to improve the effectiveness of dietary intake estimation by improving the neural network architecture. Specifically, they explore different fully - connected layer structures and feature extractors (such as vision transformers and masked auto - encoders) to optimize the performance of predicting nutrient components from food images. The paper experimentally evaluates several different model architectures and finally proposes a vision - transformer - based model - NutritionVerse - Direct. This model can directly predict the calorie, mass, protein, fat and carbohydrate content of a meal from a food image and achieves a combined mean absolute error (MAE) 25.5% lower than that of the Inception - ResNet model on the NutritionVerse - Real dataset. This indicates that the model has the potential to improve the accuracy of dietary intake estimation.