Multimodal Food Image Classification with Large Language Models

Chee Sun Won,Nam-Ho Kim,Jun-Hwa Kim,Donghyeok Jo
DOI: https://doi.org/10.3390/electronics13224552
IF: 2.9
2024-11-20
Electronics
Abstract:In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model’s ability to extract discriminative features beyond what can be achieved with visual features alone.
Computer Science,Agricultural and Food Sciences
What problem does this paper attempt to address?