CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

Zhengmin Yu,Jiutian Zeng,Siyi Chen,Wenhan Xu,Dandan Xu,Xiangyu Liu,Zonghao Ying,Nan Wang,Yuan Zhang,Min Yang
2024-11-25
Abstract:Over the past year, there has been a notable rise in the use of large language models (LLMs) for academic research and industrial practices within the cybersecurity field. However, it remains a lack of comprehensive and publicly accessible benchmarks to evaluate the performance of LLMs on cybersecurity tasks. To address this gap, we introduce CS-Eval, a publicly accessible, comprehensive and bilingual LLM benchmark specifically designed for cybersecurity. CS-Eval synthesizes the research hotspots from academia and practical applications from industry, curating a diverse set of high-quality questions across 42 categories within cybersecurity, systematically organized into three cognitive levels: knowledge, ability, and application. Through an extensive evaluation of a wide range of LLMs using CS-Eval, we have uncovered valuable insights. For instance, while GPT-4 generally excels overall, other models may outperform it in certain specific subcategories. Additionally, by conducting evaluations over several months, we observed significant improvements in many LLMs' abilities to solve cybersecurity tasks. The benchmarks are now publicly available at <a class="link-external link-https" href="https://github.com/CS-EVAL/CS-Eval" rel="external noopener nofollow">this https URL</a>.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: currently, there is a lack of a comprehensive and publicly available benchmark to evaluate the performance of large language models (LLMs) in cybersecurity tasks. Although there are some general LLM benchmarks, such as MMLU and GLUE, and some domain - specific benchmarks, such as comprehensive evaluation benchmarks in the financial and legal fields, these benchmarks often overlook the unique challenges in the cybersecurity field. In addition, although the existing cybersecurity - specific datasets provide detailed task evaluations, they lack comprehensive coverage and cannot conduct thorough evaluations. To fill this gap, the paper introduces **CS - Eval**, a publicly available, comprehensive, and bilingual LLM benchmark specifically designed for the cybersecurity field. CS - Eval integrates academic research hotspots and industry practical applications, and carefully curates a set of high - quality questions covering 42 cybersecurity categories, which are systematically organized into three cognitive levels: knowledge, ability, and application. Through extensive evaluations, the paper reveals some valuable insights. For example, although GPT - 4 generally performs excellently, in some specific sub - categories, other models may perform better. In addition, after several months of evaluation, the capabilities of many LLMs in solving cybersecurity tasks have improved significantly. ### Main contributions: 1. **Introducing CS - Eval**: This is the first open - access bilingual comprehensive cybersecurity benchmark, covering a wide range of tasks and domains, providing a comprehensive and accurate evaluation of large language models (LLMs). The public access address of the benchmark dataset is: [https://github.com/CS - EVAL/CS - Eval](https://github.com/CS - EVAL/CS - Eval). 2. **Addressing the challenges of creating a cybersecurity benchmark**: By aligning with industry and academic priorities, ensuring strict data quality, and enhancing practical insights, the paper provides valuable guidance for the development of benchmarks in other professional fields. 3. **Experimental results**: Through extensive experiments conducted in different time periods, the paper has reached several important findings, such as the best models for different tasks and the scaling laws exhibited in the benchmark. More importantly, the paper provides practical insights for future large - language - model training in specific domains. ### Experimental setup: - **Model selection**: The paper selects a variety of popular LLMs for evaluation, including open - source and closed - source models, with parameter scales ranging from small to large. See Table 2 for specific models. - **Evaluation metrics**: For each question, accuracy is mainly used as an evaluation metric, and customized evaluations are carried out for different types of questions. For example, multiple - choice questions and true - false questions are evaluated for accuracy by perfect matching with the specified answers; for open - ended questions, LLMs are used in combination with regular expressions to extract relevant answers and format them into a standardized JSON structure to ensure consistency. ### Experimental results: - **Overall comparison**: GPT - 4 8K performs the best in all fields, with an average score of 87.57, demonstrating its excellent performance in various cybersecurity tasks. Although the updated version of GPT - 4o has been introduced, GPT - 4 8K is still the best - performing model, which may be attributed to the fact that GPT - 4o is optimized in multi - modal capabilities, speed, and efficiency, making it sometimes less focused on pure - text tasks than GPT - 4 8K. Other models, such as Qwen2 - 72B - Instruct, also perform well, with an average score of 86.82. - **Performance in specific domains**: Although general - purpose models such as GPT - 4 are excellent in overall performance, other models may be more outstanding in specific domains. For example, Qwen2 - 72B - Instruct has a score (88.56) in threat detection and prevention that exceeds GPT - 4's score (85.21). ### Conclusion: By introducing CS - Eval, the paper fills the gap in the lack of a comprehensive evaluation benchmark in the cybersecurity field, providing a valuable tool for developers, users, and researchers to help them identify the limitations of models and select the models that best suit their needs. In addition, the paper also emphasizes the importance of data quality and diversity in the training process of large language models.