Abstract:Recent years have witnessed the rapid development of large language models (LLMs) in various domains. To better serve the large number of Chinese users, many commercial vendors in China have adopted localization strategies, training and providing local LLMs specifically customized for Chinese users. Furthermore, looking ahead, one of the key future applications of LLMs will be practical deployment in industrial production by enterprises and users in those sectors. However, the accuracy and robustness of LLMs in industrial scenarios have not been well studied. In this paper, we present a comprehensive empirical study on the accuracy and robustness of LLMs in the context of the Chinese industrial production area. We manually collected 1,200 domain-specific problems from 8 different industrial sectors to evaluate LLM accuracy. Furthermore, we designed a metamorphic testing framework containing four industrial-specific stability categories with eight abilities, totaling 13,631 questions with variants to evaluate LLM robustness. In total, we evaluated 9 different LLMs developed by Chinese vendors, as well as four different LLMs developed by global vendors. Our major findings include: (1) Current LLMs exhibit low accuracy in Chinese industrial contexts, with all LLMs scoring less than 0.6. (2) The robustness scores vary across industrial sectors, and local LLMs overall perform worse than global ones. (3) LLM robustness differs significantly across abilities. Global LLMs are more robust under logical-related variants, while advanced local LLMs perform better on problems related to understanding Chinese industrial terminology. Our study results provide valuable guidance for understanding and promoting the industrial domain capabilities of LLMs from both development and industrial enterprise perspectives. The results further motivate possible research directions and tooling support.

Manufacturing Domain QA with Integrated Term Enhanced RAG

Leveraging error-assisted fine-tuning large language models for manufacturing excellence

An Empirical Study on Large Language Models in Accuracy and Robustness under Chinese Industrial Scenarios

Large Language Models for Manufacturing

Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Data-Efficient Massive Tool Retrieval: A Reinforcement Learning Approach for Query-Tool Alignment with Language Models

DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

Enhancing Large Language Models' Situated Faithfulness to External Contexts

TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Large Language Model Empowered by Domain-Specific Knowledge Base for Industrial Equipment Operation and Maintenance

Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation

REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models

Knowledge sharing in manufacturing using LLM-powered tools: user study and model benchmarking

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs