ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models

Ahmed Heakl,Youssef Mohamed,Noran Mohamed,Aly Elsharkawy,Ahmed Zaky
2024-07-13
Abstract:The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92\% and a top-5 accuracy of 97.5\%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.
Computation and Language,Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the resume classification problem, which is an important task in online recruitment process. As more and more companies rely on online recruitment platforms, the efficiency and accuracy of resume classification become crucial. However, existing classification models face challenges such as small datasets, lack of standardization templates, and privacy issues, which affect the performance of the models. The paper proposes a comprehensive approach to improve resume classification. The researchers collected a large-scale resume dataset, consisting of 13,389 resumes from different sources, covering 43 different categories. This is currently the largest known resume classification dataset. They utilized large language models such as BERT and Gemini 1.1 2B for classification and achieved significantly better results than traditional machine learning methods. The top-1 accuracy of the best model is 92%, and the top-5 accuracy is 97.5%. Furthermore, the paper emphasizes the importance of dataset quality and advanced model architecture in improving the accuracy and robustness of the resume classification system. They provide open-source code and dataset to promote research reproducibility and further development. By addressing challenges in data collection and preprocessing, such as data privacy and inconsistent formats, the paper demonstrates how to overcome the limitations of current technology and advance online recruitment practices.