Abstract:Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of parameters exceeds a billion, the computation and data demands of the above approaches will be enormous. Moreover, an increasing number of code LLMs tend to be distributed as services, where the internal representation is not accessible, and the API is the only way to reach the LLM, making the prior mitigation strategies non-applicable. To cope with this, we propose CoSec, an on-the-fly Security hardening method of code LLMs based on security model-guided Co-decoding, to reduce the likelihood of code LLMs to generate code containing vulnerabilities. Our key idea is to train a separate but much smaller security model to co-decode with a target code LLM. Since the trained secure model has higher confidence for secure tokens, it guides the generation of the target base model towards more secure code generation. By adjusting the probability distributions of tokens during each step of the decoding process, our approach effectively influences the tendencies of generation without accessing the internal parameters of the target code LLM. We have conducted extensive experiments across various parameters in multiple code LLMs (i.e., CodeGen, StarCoder, and DeepSeek-Coder), and the results show that our approach is effective in security hardening. Specifically, our approach improves the average security ratio of six base models by 5.02%-37.14%, while maintaining the functional correctness of the target model.

From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting

Prompting Techniques for Secure Code Generation: A Systematic Investigation

PromSec: Prompt Optimization for Secure Generation of Functional Source Code with Large Language Models (LLMs)

Demo: SGCode: A Flexible Prompt-Optimizing System for Secure Generation of Code

SecCoder: Towards Generalizable and Robust Secure Code Generation

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

"You still have to study" -- On the Security of LLM generated code

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions

Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

CoSec: On-the-Fly Security Hardening of Code LLMs Via Supervised Co-Decoding

Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

ProSec: Fortifying Code LLMs with Proactive Security Alignment

LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward

SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI

Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation

An Empirical Study of the Code Generation of Safety-Critical Software Using LLMs

Codexity: Secure AI-assisted Code Generation