Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment

Margherita Martorana,Tobias Kuhn,Lise Stork,Jacco van Ossenbruggen

2024-09-06

Abstract:Traditional dataset retrieval systems rely on metadata for indexing, rather than on the underlying data values. However, high-quality metadata creation and enrichment often require manual annotations, which is a labour-intensive and challenging process to automate. In this study, we propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. Our analysis focuses on classifying column headers based on domain-specific topics from the Consortium of European Social Science Data Archives (CESSDA), a Linked Data controlled vocabulary. Our approach operates in a zero-shot setting, integrating the controlled topic vocabulary directly within the input prompt. This integration serves as a Large Context Windows approach, with the aim of improving the results of the topic classification task. We evaluated the performance of the LLMs in terms of internal consistency, inter-machine alignment, and agreement with human classification. Additionally, we investigate the impact of contextual information (i.e., dataset description) on the classification outcomes. Our findings suggest that ChatGPT and GoogleGemini outperform GoogleBard in terms of internal consistency as well as LLM-human-agreement. Interestingly, we found that contextual information had no significant impact on LLM performance. This work proposes a novel approach that leverages LLMs for topic classification of column headers using a controlled vocabulary, presenting a practical application of LLMs and Large Context Windows within the Semantic Web domain. This approach has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability, and Reusability (FAIR) of research data on the Web.

Databases,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

The paper attempts to address the problem of how to utilize large language models (LLMs) for zero-shot topic classification tasks in the process of metadata enrichment for restricted access datasets. Specifically, the researchers propose a method that integrates controlled vocabularies directly into the input prompts, using three large language models—ChatGPT-3.5, GoogleBard, and GoogleGemini—to classify column headers by topic. This approach aims to improve the results of topic classification tasks and promote automated metadata enrichment, thereby enhancing the findability, accessibility, interoperability, and reusability (FAIR principles) of datasets. The main contribution of the study lies in evaluating the performance of the three large language models in the specific task and comparing it with manual classification. Additionally, the study explores the impact of contextual information on classification results, finding that contextual information does not significantly affect the performance of LLMs in this task. This work provides a new perspective and practical application foundation for utilizing large language models for metadata enrichment in restricted access datasets.

Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment

Column Vocabulary Association (CVA): semantic interpretation of dataless tables

Zero-Shot Clinical Acronym Expansion via Latent Meaning Cells

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models

A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Zero-Shot Learning Over Large Output Spaces : Utilizing Indirect Knowledge Extraction from Large Language Models

Can Large Language Models Transform Computational Social Science?

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

Exploring Large Language Models for Feature Selection: A Data-centric Perspective

Leveraging Large Language Models for Topic Classification in the Domain of Public Affairs

Towards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue

Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and news

Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification

Evaluating LLMs on Entity Disambiguation in Tables