Towards Automated Recipe Genre Classification using Semi-Supervised Learning

Nazmus Sakib,G. M. Shahariar,Md. Mohsinul Kabir,Md. Kamrul Hasan,Hasan Mahmud
DOI: https://doi.org/10.48550/arXiv.2310.15693
2023-10-24
Computation and Language
Abstract:Sharing cooking recipes is a great way to exchange culinary ideas and provide instructions for food preparation. However, categorizing raw recipes found online into appropriate food genres can be challenging due to a lack of adequate labeled data. In this study, we present a dataset named the ``Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking Recipe Dataset" that contains two million culinary recipes labeled in respective categories with extended named entities extracted from recipe descriptions. This collection of data includes various features such as title, NER, directions, and extended NER, as well as nine different labels representing genres including bakery, drinks, non-veg, vegetables, fast food, cereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends the size of the Named Entity Recognition (NER) list to address missing named entities like heat, time or process from the recipe directions using two NER extraction tools. 3A2M+ dataset provides a comprehensive solution to the various challenging recipe-related tasks, including classification, named entity recognition, and recipe generation. Furthermore, we have demonstrated traditional machine learning, deep learning and pre-trained language models to classify the recipes into their corresponding genre and achieved an overall accuracy of 98.6\%. Our investigation indicates that the title feature played a more significant role in classifying the genre.
What problem does this paper attempt to address?
The paper attempts to address the problem of automatic classification of online cooking recipes. Specifically, the authors focus on how to categorize a large number of raw recipes found on the internet into appropriate food types. This task is challenging due to the lack of sufficient labeled data. To tackle this issue, the authors created a dataset named "Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking Recipe Dataset," which contains 2 million cooking recipes with extended named entities, and these recipes are annotated with corresponding category labels. Additionally, the authors proposed a semi-supervised learning-based method to improve the accuracy of named entity recognition (NER), particularly for missing named entities related to temperature, time, and methods in the cooking process. Through this approach, the authors not only enhanced the accuracy of recipe classification but also provided valuable resources for other recipe-related tasks, such as recommendation systems, ingredient substitution, dietary analysis, and recipe summarization. Ultimately, the study achieved an overall accuracy of 98.6% in the recipe classification task, demonstrating the effectiveness of pre-trained language models (such as DistilBERT and RoBERTa) in recipe classification.