Abstract:The assessment of cybersecurity Capture-The-Flag (CTF) exercises involves participants finding text strings or ``flags'' by exploiting system vulnerabilities. Large Language Models (LLMs) are natural-language models trained on vast amounts of words to understand and generate text; they can perform well on many CTF challenges. Such LLMs are freely available to students. In the context of CTF exercises in the classroom, this raises concerns about academic integrity. Educators must understand LLMs' capabilities to modify their teaching to accommodate generative AI assistance. This research investigates the effectiveness of LLMs, particularly in the realm of CTF challenges and questions. Here we evaluate three popular LLMs, OpenAI ChatGPT, Google Bard, and Microsoft Bing. First, we assess the LLMs' question-answering performance on five Cisco certifications with varying difficulty levels. Next, we qualitatively study the LLMs' abilities in solving CTF challenges to understand their limitations. We report on the experience of using the LLMs for seven test cases in all five types of CTF challenges. In addition, we demonstrate how jailbreak prompts can bypass and break LLMs' ethical safeguards. The paper concludes by discussing LLM's impact on CTF exercises and its implications.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily explores the application of large language models (LLMs) in the field of cybersecurity, particularly in addressing Capture-The-Flag (CTF) challenges and professional certification exams. Specifically, the paper attempts to answer the following questions: 1. **How do LLMs perform in professional certification exams?** - Researchers evaluated the ability of three popular LLMs (OpenAI's ChatGPT, Google's Bard, and Microsoft's Bing) to answer questions of varying difficulty levels from Cisco certification exams. These certifications range from entry-level (CCNA) to advanced (CCIE). 2. **How do LLMs perform in solving CTF challenges?** - Researchers assessed these LLMs' ability to solve five different types of CTF challenges through seven test cases. The five types include: Web security, binary exploitation, cryptography, reverse engineering, and digital forensics. 3. **Security and ethical issues of LLMs** - Researchers also explored how "jailbreak prompts" can bypass the security policies of LLMs, enabling them to provide information on offensive operations. This raises discussions on the ethical issues of using LLMs in academic integrity and cybersecurity education. ### Main Research Content 1. **Performance in Professional Certification Exams** - The study found that ChatGPT performed well in answering factual multiple-choice questions (MCQs) with an accuracy rate of up to 82%, but performed poorly in answering conceptual questions with an accuracy rate of about 50%. Bard and Bing performed relatively weaker. 2. **Ability to Solve CTF Challenges** - In the seven test cases, ChatGPT successfully solved six, Bard solved two, and Bing solved only one. Although Bing came close to the correct answer in some cases, it ultimately failed to solve the problems. - Through "jailbreak prompts," researchers found that they could bypass the security policies of LLMs, enabling them to provide information on offensive operations. This indicates potential risks when LLMs are used to solve CTF challenges. ### Conclusion - **Professional Certification Exams**: LLMs perform well in answering factual questions but perform poorly in handling conceptual questions that require complex reasoning. - **CTF Challenges**: LLMs show some capability in solving CTF challenges, especially ChatGPT. However, the practice of bypassing security policies using "jailbreak prompts" raises ethical and security concerns. - **Future Work**: As LLMs continue to improve, future research can further explore their application in CTF competitions and classroom teaching, as well as how to ensure their safe and ethical use. ### References - [1] Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2022. CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. - [2] Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. 2007. Large language models in machine translation. (2007). - [3] Tom Brown, Ben...

Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

On the Uses of Large Language Models to Interpret Ambiguous Cyberattack Descriptions

Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks

A Preliminary Study on Using Large Language Models in Software Pentesting

Large Language Models for Cyber Security: A Systematic Literature Review

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Large Language Models in Cybersecurity: State-of-the-Art

Global Challenge for Safe and Secure LLMs Track 1

Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models