Phillip Schneider,Manuel Klettner,Kristiina Jokinen,Elena Simperl,Florian Matthes
Abstract:Conversational question answering systems often rely on semantic parsing to enable interactive information retrieval, which involves the generation of structured database queries from a natural language input. For information-seeking conversations about facts stored within a knowledge graph, dialogue utterances are transformed into graph queries in a process that is called knowledge-based conversational question answering. This paper evaluates the performance of large language models that have not been explicitly pre-trained on this task. Through a series of experiments on an extensive benchmark dataset, we compare models of varying sizes with different prompting techniques and identify common issue types in the generated output. Our results demonstrate that large language models are capable of generating graph queries from dialogues, with significant improvements achievable through few-shot prompting and fine-tuning techniques, especially for smaller models that exhibit lower zero-shot performance.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to evaluate the ability of large - language models (LLMs) to perform semantic parsing in conversational question - answering systems, especially their performance on knowledge graphs. Specifically, the research aims to explore the following aspects:
1. **Understanding conversations and generating SPARQL queries**: Evaluate whether LLMs can transform natural - language conversations into structured queries (such as SPARQL queries) for knowledge graphs, thereby achieving knowledge - graph - based conversational question - answering.
2. **Comparison of different models and prompting techniques**: Through a series of experiments, compare the performance differences of LLMs of different sizes under different prompting techniques (such as zero - shot, few - shot prompting), and identify common types of output problems.
3. **Optimizing model performance**: Explore how to improve the performance of LLMs in semantic - parsing tasks through fine - tuning and other strategies, especially for the improvement of smaller models.
### Research Background
Conversational question - answering systems usually rely on semantic parsing to convert natural - language inputs into structured database queries for interactive information retrieval. For fact - querying based on knowledge graphs, conversational expressions need to be converted into graph queries, a process known as knowledge - based conversational question - answering. However, most of the existing work focuses on independent natural - language expressions and ignores broader context information. Therefore, this study pays special attention to a series of related expressions in conversations, fuzzy queries, and evolving search intentions.
### Main Contributions
- **Benchmarking study**: Four different LLMs were evaluated, and eight common error types in generating graph queries were identified using automatic metrics and human evaluation.
- **Detailed discussion**: The effects of prompting and fine - tuning strategies on model performance were explored, aiming to improve the model's performance in conversational question - answering.
- **Reproducibility**: A GitHub repository was established, containing all model scripts, datasets, and evaluation outputs, ensuring full reproducibility of the experimental results.
### Experimental Setup
- **Dataset**: The SPICE dataset was selected, which contains 197,000 conversations, each accompanied by an executable SPARQL query.
- **Model selection**: Four LLMs of different scales were compared, including GPT - 3.5 - Turbo, LLaMA and its fine - tuned version LoRA, and Vicuna.
- **Prompting methods**: Zero - shot and few - shot prompting were used to evaluate the model's performance under different conditions.
### Results and Discussion
The experimental results show that LLMs exhibit significant differences in semantic - parsing ability in conversational question - answering. The fine - tuned LoRA model performs well in almost all tasks, especially when dealing with simple questions. However, for complex questions (such as logical reasoning and quantitative reasoning), the performance of all models declines. In addition, human evaluation further reveals eight common error types in the model - generated outputs, providing a basis for subsequent improvements.
In conclusion, this study not only evaluates the semantic - parsing ability of LLMs in conversational question - answering but also provides valuable insights for optimizing these models.