Abstract:Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support.

What problem does this paper attempt to address?

The paper aims to address the accuracy and robustness issues of large language models (LLMs) in Chinese industrial scenarios. Although large language models perform excellently in natural language processing tasks, their performance in Chinese industrial production environments has not been fully studied. Therefore, the authors conducted a comprehensive empirical study to evaluate the accuracy and robustness of these models in specific industrial domains. Specifically, the paper attempts to answer the following questions: 1. How accurate are large language models in Chinese industrial scenarios? 2. How robust are large language models in different Chinese industrial scenarios? 3. What are the differences in robustness capabilities of large language models between different industrial domains? 4. How does the performance of local large language models vary in different robustness capabilities? To answer these questions, the authors manually collected 1,200 domain-specific questions from eight different industrial sectors to evaluate the models' accuracy. Additionally, they designed a deformation testing framework containing four industrial-specific stability categories to assess the models' robustness through 13,631 questions and their variants. The study evaluated 9 local large language models developed by Chinese vendors and 4 models developed by global vendors. The main findings of the study include: - The accuracy of all currently evaluated large language models in Chinese industrial scenarios is below 60%, with significant performance differences across different industrial domains. - Global large language models perform better in logical reasoning and open-ended tasks, while local large language models have an advantage in understanding Chinese terminology. - There are differences in robustness scores between different industrial domains, with local models generally performing worse than global models. - There are also significant differences in the models' performance across different robustness capabilities, with global models being more robust in logic-related variants, while advanced local models perform better in understanding Chinese industrial terminology. These findings provide guidance for developing large language models more suitable for non-English (Chinese) users in industrial applications, helping platform engineers and enterprises improve the application of local models in manufacturing.

An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios

Large Language Model Empowered by Domain-Specific Knowledge Base for Industrial Equipment Operation and Maintenance

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models for Manufacturing

Applying Large Language Models for Intelligent Industrial Automation

CMMLU: Measuring massive multitask language understanding in Chinese

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

Unlocking the Potential: Benchmarking Large Language Models in Water Engineering and Research

MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

On the (In)Effectiveness of Large Language Models for Chinese Text Correction

Studying and Benchmarking Large Language Models For Log Level Suggestion

Insights into the Development Trends of Industrial Large Language Models

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

Leveraging error-assisted fine-tuning large language models for manufacturing excellence

Large Language Models at Work in China's Labor Market

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

Control Industrial Automation System with Large Language Models

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis