Abstract:Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM). To begin, we analyze the anti-patterns generated by the LLM and show that LLMs can generate undesirable test smells. Thus, we train specific reward models for each static quality metric, then utilize Proximal Policy Optimization (PPO) to train models for optimizing a single quality metric at a time. Furthermore, we amalgamate these rewards into a unified reward model aimed at capturing different best practices and quality aspects of tests. By comparing RL-trained models with those trained using supervised learning, we provide insights into how reliably utilize RL to improve test generation quality and into the effects of various training strategies. Our experimental results demonstrate that the RL-optimized model consistently generated high-quality test cases compared to the base LLM, improving the model by up to 21%, and successfully generates nearly 100% syntactically correct code. RLSQM also outperformed GPT-4 on four out of seven metrics. This represents a significant step towards enhancing the overall efficiency and reliability of software testing through Reinforcement Learning and static quality metrics. Our data are available at this link: <a class="link-external link-https" href="https://figshare.com/s/ded476c8d4c221222849" rel="external noopener nofollow">this https URL</a>.

MLinter: Learning Coding Practices from Examples-Dream or Reality?

Code Linting using Language Models

How Beginning Programmers and Code LLMs (Mis)read Each Other

Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better Language Models for Code Understanding

Navigating the Pitfalls: Analyzing the Behavior of LLMs as a Coding Assistant for Computer Science Students—A Systematic Review of the Literature

Leveraging Large Language Models for Automating Inductive Qualitative Coding: A Comparative Study of Prompt Engineering Techniques

Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword?

Out of style: Misadventures with LLMs and code style transfer

Using an LLM to Help With Code Understanding

Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey

LLM-Assisted Code Cleaning For Training Accurate Code Generators

Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair

RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code

AI-powered Code Review with LLMs: Early Results

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Multilingual training for Software Engineering

Teaching Machines to Code: Smart Contract Translation with LLMs