Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data

Dehai Min,Nan Hu,Rihui Jin,Nuo Lin,Jiaoyan Chen,Yongrui Chen,Yu Li,Guilin Qi,Yun Li,Nijun Li,Qianren Wang
2024-04-09
Abstract:Augmenting Large Language Models (LLMs) for Question Answering (QA) with domain specific data has attracted wide attention. However, domain data often exists in a hybrid format, including text and semi-structured tables, posing challenges for the seamless integration of information. Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus. Although this technique has been widely studied by the NLP community, there is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems. In this paper, we address this research gap in two steps. First, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data. Then, we utilize this framework in real-world industrial data to conduct extensive experiments on two types of QA systems (DSFT and RAG frameworks) with four representative methods: Markdown format, Template serialization, TPLM-based method, and LLM-based method. Based on the experimental results, we draw some empirical findings and explore the underlying reasons behind the success of some methods. We hope the findings of this work will provide a valuable reference for the academic and industrial communities in developing robust QA systems.
Computer Science
What problem does this paper attempt to address?
This paper aims to address the issue of how to enhance the performance of large language models (LLMs) in domain-specific question-answering systems using different table-to-text methods. Specifically, the paper focuses on the impact of corpora generated by different table-to-text methods on the performance of question-answering systems when dealing with mixed data containing text and semi-structured tables. Currently, although table-to-text generation techniques have been widely studied, there is a lack of comparative analysis on how corpora generated by different methods affect the performance of domain-specific question-answering systems. Therefore, this study fills this research gap by innovatively integrating table-to-text generation techniques into a framework to enhance LLMs and conducting extensive experiments on real industrial data. The study evaluates the performance of four representative table-to-text methods (Markdown format, template serialization, TPLM-based methods, and LLM-based methods) on two types of question-answering systems (DSFT and RAG frameworks). Through experimental results, the authors derived several empirical findings, explored the reasons behind the success of certain methods, and hope that these findings can provide valuable references for academia and industry in developing robust question-answering systems.