HFCommunity: An Extraction Process and Relational Database to Analyze Hugging Face Hub Data

Adem Ait,Javier Luis Cánovas Izquierdo,Jordi Cabot
DOI: https://doi.org/10.1016/j.scico.2024.103079
IF: 1.039
2024-01-12
Science of Computer Programming
Abstract:Social coding platforms such as GitHub or GitLab have become the de facto standard for developing Open-Source Software (OSS) projects. With the emergence of Machine Learning (ML), platforms specifically designed for hosting and developing ML-based projects have appeared, being Hugging Face Hub (HFH) one of the most popular ones. HFH aims at sharing datasets, pre-trained ML models and the applications built with them. With over 400 K repositories, and growing fast, HFH is becoming a promising source of empirical data on all aspects of ML project development. However, apart from the API provided by the platform, there are no easy-to-use solutions to collect the data, nor prepackaged datasets to explore the different facets of HFH. We present HFCommunity , an extraction process for HFH data and a relational database to facilitate an empirical analysis on the growing number of ML projects.
computer science, software engineering
What problem does this paper attempt to address?