Abstract:Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching the code base. We use kGym to facilitate evaluation on kBench, a crash resolution benchmark drawn from real-world Linux kernel bugs. An example bug in kBench contains crashing stack traces, a bug-reproducer file, a developer-written fix, and other associated data. To understand current performance, we conduct baseline experiments by prompting LLMs to resolve Linux kernel crashes. Our initial evaluations reveal that the best performing LLM achieves 0.72% and 5.38% in the unassisted and assisted (i.e., buggy files disclosed to the model) settings, respectively. These results highlight the need for further research to enhance model performance in SE tasks. Improving performance on kBench requires models to master new learning skills, including understanding the cause of crashes and repairing faults, writing memory-safe and hardware-aware code, and understanding concurrency. As a result, this work opens up multiple avenues of research at the intersection of machine learning and systems software.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and improve the performance of large - language models (LLMs) in solving Linux kernel crash problems. Specifically, the authors introduce a platform named KGYM and a dataset named KBENCH SYZ to test and evaluate the performance of LLMs in fixing Linux kernel crashes. ### Problem Background The Linux kernel is a complex and critical system software with the following characteristics: - **Multilingual**: It contains multiple programming languages such as C, assembly, Bash, and Rust. - **Huge in scale**: The codebase has more than 20 million lines of code, distributed in 50,000 files. - **Highly concurrent**: It involves complex multithreaded operations and is prone to problems such as deadlocks and race conditions. - **Hardware - aware**: It is necessary to write memory - safe and hardware - aware code. - **Decentralized development**: It is maintained by thousands of developers around the world, and the code styles and specifications vary. These characteristics make the Linux kernel a very challenging research object, especially when using machine - learning models for automatic repair. ### Research Objectives The authors' goals are to evaluate the existing LLMs' ability to solve Linux kernel crash problems by constructing the KGYM platform and the KBENCH SYZ dataset, and to reveal the deficiencies of the current models, thus providing directions for future research. Specific objectives include: 1. **Construct an evaluation platform**: Develop a platform (KGYM) that can automate compilation, running, crash detection, and patch application for large - scale experiments. 2. **Create a benchmark dataset**: Extract samples from real - world Linux kernel crashes and construct a dataset (KBENCH SYZ) containing 279 crash - repair cases. 3. **Evaluate existing models**: Use the KGYM platform to benchmark multiple LLMs and evaluate their performance in fixing Linux kernel crashes. 4. **Reveal improvement directions**: Through the analysis of experimental results, point out the main challenges and improvement directions of current LLMs in handling Linux kernel crashes. ### Experimental Results By conducting experiments on the KBENCH SYZ dataset, the authors found that even the most advanced LLMs can only successfully repair 0.72% of crashes in the unaided setting and 5.38% of crashes in the aided setting. This indicates that the existing LLMs still have a great deal of room for improvement in handling Linux kernel crash problems, especially in understanding the causes of crashes, fixing faults, writing memory - safe and hardware - aware code, and handling concurrent problems. ### Conclusion This research reveals the limitations of current LLMs in handling crash - repair tasks for complex system software such as the Linux kernel, and provides clear directions for future research. By further improving the learning and reasoning abilities of the models, it is expected to improve the performance of LLMs in this area.

KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution

Resolving Crash Bugs via Large Language Models: An Empirical Study

DebugBench: Evaluating Debugging Capability of Large Language Models

CrashEventLLM: Predicting System Crashes with Large Language Models

SECURE: Benchmarking Large Language Models for Cybersecurity

Exploring and Characterizing Large Language Models For Embedded System Development and Debugging

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

Debugging with Open-Source Large Language Models: An Evaluation

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Are Large Language Models Memorizing Bug Benchmarks?

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

A Software Engineering Perspective on Testing Large Language Models: Research, Practice, Tools and Benchmarks

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity

VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Benchmarking Large Language Models for Log Analysis, Security, and Interpretation

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Liger Kernel: Efficient Triton Kernels for LLM Training