Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family

Yiming Tan,Dehai Min,Yu Li,Wenbo Li,Nan Hu,Yongrui Chen,Guilin Qi
DOI: https://doi.org/10.48550/arXiv.2303.07992
2023-09-20
Abstract:ChatGPT is a powerful large language model (LLM) that covers knowledge resources such as Wikipedia and supports natural language question answering using its own knowledge. Therefore, there is growing interest in exploring whether ChatGPT can replace traditional knowledge-based question answering (KBQA) models. Although there have been some works analyzing the question answering performance of ChatGPT, there is still a lack of large-scale, comprehensive testing of various types of complex questions to analyze the limitations of the model. In this paper, we present a framework that follows the black-box testing specifications of CheckList proposed by Ribeiro et. al. We evaluate ChatGPT and its family of LLMs on eight real-world KB-based complex question answering datasets, which include six English datasets and two multilingual datasets. The total number of test cases is approximately 190,000. In addition to the GPT family of LLMs, we also evaluate the well-known FLAN-T5 to identify commonalities between the GPT family and other LLMs. The dataset and code are available at <a class="link-external link-https" href="https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-GPT-family.git" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate large - language models (LLMs), especially the GPT - family models represented by ChatGPT, in knowledge - base - based complex question - answering tasks (KB - based Complex Question Answering, KB - CQA), and whether these models can replace traditional knowledge - base - based question - answering (KBQA) models. Specifically, the paper focuses on the following aspects: 1. **Model performance evaluation**: By constructing a framework containing eight real - world KB - CQA datasets, conduct large - scale tests on ChatGPT and its related LLMs to evaluate their answering abilities on different types of complex questions. 2. **Model comparison**: Not only the GPT - family models are evaluated, but also the well - known non - GPT - family LLM FLAN - T5 is introduced for comparison to identify the commonalities and differences between the GPT - family and other LLMs. 3. **Technical progress analysis**: By comparing different versions of GPT models (such as GPT - 3, GPT - 3.5 v2, GPT - 3.5 v3, ChatGPT and GPT - 4), analyze the technical progress of each generation of models and the performance improvements they bring. 4. **Multilingual ability evaluation**: In addition to English datasets, two multilingual datasets are also included to evaluate the performance of the models in different languages and explore the development trend of their multilingual processing abilities. 5. **Feature label evaluation**: Conduct a detailed analysis of the performance of the models according to the types of questions (such as answer types, reasoning types and language types) to reveal the advantages and disadvantages of the models on different types of questions. Through these evaluations, the paper aims to fully understand the potential of existing LLMs in answering complex knowledge questions and whether they have the ability to surpass or replace the current best KBQA models.