Abstract:Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support.

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Leveraging Large Language Models for Multiple Choice Question Answering

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

LB-KBQA: Large-language-model and BERT based Knowledge-Based Question and Answering System

An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios

Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering

Spoken Language Intelligence of Large Language Models for Language Learning

Large Language Models in Healthcare: A Comprehensive Benchmark

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

ToolQA: A Dataset for LLM Question Answering with External Tools

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

Exploring Language Model Generalization in Low-Resource Extractive QA

Enhancing Large Language Models with Knowledge Graphs for Robust Question Answering

LLMs May Perform MCQA by Selecting the Least Incorrect Option

FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models

Research on Intelligent Question-Answering Systems Based on Large Language Models and Knowledge Graphs

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Allies: Prompting Large Language Model with Beam Search