An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios

Zongjie Li,Wenying Qiu,Pingchuan Ma,Yichen Li,You Li,Sijia He,Baozheng Jiang,Shuai Wang,Weixi Gu
2024-01-27
Abstract:Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the accuracy and robustness issues of large language models (LLMs) in Chinese industrial scenarios. Although large language models perform excellently in natural language processing tasks, their performance in Chinese industrial production environments has not been fully studied. Therefore, the authors conducted a comprehensive empirical study to evaluate the accuracy and robustness of these models in specific industrial domains. Specifically, the paper attempts to answer the following questions: 1. How accurate are large language models in Chinese industrial scenarios? 2. How robust are large language models in different Chinese industrial scenarios? 3. What are the differences in robustness capabilities of large language models between different industrial domains? 4. How does the performance of local large language models vary in different robustness capabilities? To answer these questions, the authors manually collected 1,200 domain-specific questions from eight different industrial sectors to evaluate the models' accuracy. Additionally, they designed a deformation testing framework containing four industrial-specific stability categories to assess the models' robustness through 13,631 questions and their variants. The study evaluated 9 local large language models developed by Chinese vendors and 4 models developed by global vendors. The main findings of the study include: - The accuracy of all currently evaluated large language models in Chinese industrial scenarios is below 60%, with significant performance differences across different industrial domains. - Global large language models perform better in logical reasoning and open-ended tasks, while local large language models have an advantage in understanding Chinese terminology. - There are differences in robustness scores between different industrial domains, with local models generally performing worse than global models. - There are also significant differences in the models' performance across different robustness capabilities, with global models being more robust in logic-related variants, while advanced local models perform better in understanding Chinese industrial terminology. These findings provide guidance for developing large language models more suitable for non-English (Chinese) users in industrial applications, helping platform engineers and enterprises improve the application of local models in manufacturing.