Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment

Margherita Martorana,Tobias Kuhn,Lise Stork,Jacco van Ossenbruggen
2024-09-06
Abstract:Traditional dataset retrieval systems rely on metadata for indexing, rather than on the underlying data values. However, high-quality metadata creation and enrichment often require manual annotations, which is a labour-intensive and challenging process to automate. In this study, we propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. Our analysis focuses on classifying column headers based on domain-specific topics from the Consortium of European Social Science Data Archives (CESSDA), a Linked Data controlled vocabulary. Our approach operates in a zero-shot setting, integrating the controlled topic vocabulary directly within the input prompt. This integration serves as a Large Context Windows approach, with the aim of improving the results of the topic classification task. We evaluated the performance of the LLMs in terms of internal consistency, inter-machine alignment, and agreement with human classification. Additionally, we investigate the impact of contextual information (i.e., dataset description) on the classification outcomes. Our findings suggest that ChatGPT and GoogleGemini outperform GoogleBard in terms of internal consistency as well as LLM-human-agreement. Interestingly, we found that contextual information had no significant impact on LLM performance. This work proposes a novel approach that leverages LLMs for topic classification of column headers using a controlled vocabulary, presenting a practical application of LLMs and Large Context Windows within the Semantic Web domain. This approach has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data on the Web.
Databases,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the problem of how to utilize large language models (LLMs) for zero-shot topic classification tasks in the process of metadata enrichment for restricted access datasets. Specifically, the researchers propose a method that integrates controlled vocabularies directly into the input prompts, using three large language models—ChatGPT-3.5, GoogleBard, and GoogleGemini—to classify column headers by topic. This approach aims to improve the results of topic classification tasks and promote automated metadata enrichment, thereby enhancing the findability, accessibility, interoperability, and reusability (FAIR principles) of datasets. The main contribution of the study lies in evaluating the performance of the three large language models in the specific task and comparing it with manual classification. Additionally, the study explores the impact of contextual information on classification results, finding that contextual information does not significantly affect the performance of LLMs in this task. This work provides a new perspective and practical application foundation for utilizing large language models for metadata enrichment in restricted access datasets.