Abstract:The rapid deployment of Large Language Models (LLMs) requires careful consideration of their effect on cybersecurity. Our work aims to improve the selection process of LLMs that are suitable for facilitating Secure Coding (SC). This raises challenging research questions, such as (RQ1) Which functionality can streamline the LLM evaluation? (RQ2) What should the evaluation measure? (RQ3) How to attest that the evaluation process is impartial? To address these questions, we introduce LLMSecCode, an open-source evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying parameters and prompts, we find a 10% and 9% difference in performance, respectively. We also compare some results to reliable external actors, where our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and encourage further development by external actors. With LLMSecCode, we hope to encourage the standardization and benchmarking of LLMs' capabilities in security-oriented code and tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to select large language models (LLMs) suitable for secure coding (SC), and ensure that the performance of these models in automated program repair (APR) and code generation (CG) tasks can meet the requirements of secure coding. Specifically, the paper focuses on the following three research questions: 1. **RQ1**: Which features can simplify the evaluation of the secure coding capabilities of LLMs? 2. **RQ2**: What metrics should the evaluation measure? 3. **RQ3**: How to prove the fairness of the evaluation process? To answer these questions, the authors introduced an open - source evaluation framework named LLMSecCode. This framework aims to objectively evaluate the capabilities of LLMs in secure coding and verify its effectiveness through experiments. ### Detailed Explanation #### Research Background With the rapid development of large language models (LLMs), their applications in the field of network security are also becoming more and more widespread. Especially in secure coding, LLMs have the potential to discover errors and propose security improvement measures. However, how to select a suitable LLM to support secure coding is a complex problem. For this reason, the authors proposed three key research questions and developed the LLMSecCode framework to solve these problems. #### Functions of the LLMSecCode Framework - **RQ1**: To simplify the evaluation of the secure coding capabilities of LLMs, the LLMSecCode framework has designed several key functions: - It supports adjusting model parameters (such as temperature and top - p) to observe the impact of different settings on performance. - It supports customizing prompts to adapt to different task requirements. - **RQ2**: The metrics that should be measured in the evaluation include: - Pass rate (@k), that is, the probability that at least one of the first k generated code samples passes the unit test. - Proportion of fault - free solutions (pass rate), that is, the ratio of the number of non - fault solutions to all evaluated solutions. - **RQ3**: To ensure the fairness of the evaluation process, the LLMSecCode framework has taken the following measures: - Use the same methods and tools for comparison. - Utilize a wide range of synthetic and real - world data sets. - Undergo community review through open - source development. #### Experimental Results The authors verified the effectiveness of the LLMSecCode framework through experiments. The experimental results show that under different parameters and prompts, the performance differences of LLMs are 10% and 9% respectively. In addition, compared with reliable external evaluations, the result difference of LLMSecCode is only 5%, indicating that its implementation is correct and reliable. #### Contributions The main contributions of the LLMSecCode framework include: - Provide a general open - source framework for evaluating the capabilities of LLMs in APR, CG, and SC. - Verify the effectiveness and fairness of the framework through experiments. - Provide a unified platform for model creators and users to evaluate and benchmark the secure coding capabilities of LLMs. In conclusion, this paper solves the important problem of how to select and evaluate LLMs suitable for secure coding by introducing the LLMSecCode framework, and provides new perspectives and tools for future research.

LLMSecCode: Evaluating Large Language Models for Secure Coding

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Exploring Advanced Methodologies in Security Evaluation for LLMs

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations

CoSec: On-the-Fly Security Hardening of Code LLMs Via Supervised Co-Decoding

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

SECURE: Benchmarking Large Language Models for Cybersecurity

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

Software Vulnerability and Functionality Assessment using LLMs

LLM Security Guard for Code

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Large Language Models for Code: Security Hardening and Adversarial Testing

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

SALLM: Security Assessment of Generated Code

Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code

CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Large Language Models for Cyber Security: A Systematic Literature Review