Abstract:As an important component of data exploration and integration, Column Type Annotation (CTA) aims to label columns of a table with one or more semantic types. With the recent development of Large Language Models (LLMs), researchers have started to explore the possibility of using LLMs for CTA, leveraging their strong zero-shot capabilities. In this paper, we build on this promising work and improve on LLM-based methods for CTA by showing how to use a Knowledge Graph (KG) to augment the context information provided to the LLM. Our approach, called RACOON, combines both pre-trained parametric and non-parametric knowledge during generation to improve LLMs' performance on CTA. Our experiments show that RACOON achieves up to a 0.21 micro F-1 improvement compared against vanilla LLM inference.

What problem does this paper attempt to address?

The paper attempts to address the challenges in Column Type Annotation (CTA). Specifically, the goal of CTA is to annotate each column in a table with one or more semantic types to aid in data exploration and integration. Although existing large language models (LLMs) perform well in zero-shot learning, they still face challenges in handling outdated knowledge, generating factual errors, and dealing with domain-specific queries. To address these issues, the paper proposes a new framework called RACOON, which enhances the performance of LLMs by incorporating knowledge graphs (KG). ### Main Contributions: 1. **Introduction of the Problem**: Proposes how to leverage external knowledge from knowledge graphs to enhance the performance of LLMs in the CTA task and introduces the end-to-end framework RACOON. 2. **Information Granularity**: Explores the different granularities of information that RACOON can retrieve from knowledge graphs and provides effective post-retrieval compression and serialization methods to extract additional contextual information from column cells to enhance prompts. 3. **Experimental Results**: Experiments show that RACOON consistently outperforms plain LLM reasoning across various scenarios and retrieval methods, with a maximum improvement of 0.21 in micro F1 score. ### Framework Overview: - **Retriever**: Parses the input table and retrieves entities related to column cells and their neighboring information from the knowledge graph. - **Processor**: Compresses and refines the retrieved information to ensure it is relevant and concise. - **Augmentor**: Serializes the compressed information into natural language and inserts it into the original prompt to form the final knowledge graph-enhanced prompt. ### Experimental Setup: - **Dataset**: Uses the WikiTables-TURL-CTA benchmark dataset, which contains 13,025 columns, each annotated with 255 Freebase types for multi-label annotation. - **Baseline Methods**: Compared with plain LLM methods. - **Evaluation Metrics**: Evaluated using the micro F1 score. ### Experimental Results: - **Perfect Entity Linker**: Assuming a perfect entity linker, RACOON with ENTITY-LABELS and ENTITY-TRIPLETS enhanced prompts outperforms plain LLM reasoning, with ENTITY-TRIPLETS performing the best. - **Different Entity Linkers**: Evaluated using the MediaWiki API and the state-of-the-art entity linking model ReFinED, RACOON still outperforms baseline methods, although performance drops, especially when initial entity linking is incorrect. ### Conclusion: RACOON significantly improves the performance of the CTA task by combining knowledge graphs and LLMs. The framework performs well in both multi-label and single-label settings, particularly when using advanced entity linking models.

RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph

ArcheType: A Novel Framework for Open-Source Column Type Annotation Using Large Language Models

KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities

RAC: Efficient LLM Factuality Correction with Retrieval Augmentation

Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs

KICGPT: Large Language Model with Knowledge in Context for Knowledge Graph Completion

On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models

CogMG: Collaborative Augmentation Between Large Language Model and Knowledge Graph

KBLaM: Knowledge Base augmented Language Model

Deep Sparse Latent Feature Models for Knowledge Graph Completion

Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation

KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view Knowledge Graph Contrastive Learning

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

ActiveRAG: Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents

Think and Retrieval: A Hypothesis Knowledge Graph Enhanced Medical Large Language Models

Meta Knowledge for Retrieval Augmented Large Language Models