Performance Evaluations of Large Language Models for Customer Service
Fei Li,Yanyan Wang,Yin Xu,Shiling Wang,Junli Liang,Zhengyi Chen,Wenrui Liu,Qiangzhong Feng,Ticheng Duan,Youzhi Huang,Qi Song,Xiangyang Li
DOI: https://doi.org/10.1007/s13042-024-02432-9
2024-01-01
International Journal of Machine Learning and Cybernetics
Abstract:Recent advances in large language models (LLMs) show broad promise for a variety of natural language processing (NLP) tasks. There is growing interest in LLMs for domain-specific applications, such as in the telecommunications customer service domain. However, most of the benchmarks are for open-source LLMs, lacking effective exploration of real scenarios. To explore the performance of LLMs for customer service, we propose the first Telecommunications Customer Service Evaluation Benchmark (TeleEval-CS). In our work, we simulate the customer service pre-call, in-call, and post-call using 8.1k examples of 15 subtasks containing 21 datasets. We build 90k domain-specific multi-tasking instruction samples and fine-tune three types of LLMs including basic NLP, knowledge Q A, and multi-round dialogue to realize industry knowledge injection. We conduct a comprehensive evaluation in multiple scenarios on 34 open-source, 5 closed-source, and 4 fine-tuned LLMs with zero-shot and few-shot approaches on TeleEval-CS. Experimental results show that the open-source LLMs can also perform better than the closed-source LLMs. The performance of the fine-tuned LLM depends on the quality of the fine-tuning dataset rather than its size, and fine-tuning has great potential in customer service scenarios. We provide our data and code at https://github.com/zsjslab/TeleEval-CS .