Abstract:Python has been widely used to develop large-scale software systems such as distributed systems, cloud computing, artificial intelligence, and Web platforms due to its flexibility and versatility. As a kind of complex software, Python interpreter could also suffer from software bugs and thus fundamentally threaten the quality of all Python program applications. Since the first release of Python, more than 30,000 bugs have been discovered. While modern interpreters often consist of many modules, built-in libraries, extensions, etc, they could reach millions of code lines. The large size and high complexity of interpreters bring substantial challenges to their quality assurance. To characterize the interpreter bugs and provide empirical supports, this paper conducts a large-scale empirical study on the two most popular Python interpreters – CPython and PyPy. We have comprehensively investigated the maintenance log information and collected 30,069 fixed bugs and 20,334 confirmed revisions. We further manually characterized and taxonomized 1200 bugs to investigate their representative symptoms and root causes deeply. Finally, we identified nine findings by comprehensively investigating bug locations, symptoms, root causes, and bug revealing & fixing time. The key findings include (for both interpreters): (1) the Library, object model, and interpreter back-end are the most buggy components; (2) unexpected behavior, crash, and performance are the most common symptoms; (3) incorrect algorithm logic, configuration, and internal call are the most common general root causes; incorrect object design is the most common Python-specific root cause; (4) some test-program triggering bugs are tiny (less than ten lines), and most bug fixes only involve slight modifications. Depending on these findings, we discuss the lessons learned and practical implications that can support the research on interpreters’ testing, debugging, and improvements.

Tests4Py: A Benchmark for System Testing

BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies

DyPyBench: A Benchmark of Executable Python Software

Benchmark Frameworks and Τbench

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents

An empirical study of automated unit test generation for Python

ChiBench: a Benchmark Suite for Testing Electronic Design Automation Tools

A configurable benchmark test management framework

Towards Understanding Bugs in Python Interpreters

GitBug-Java: A Reproducible Benchmark of Recent Java Bugs

An Automated System for Interactively Learning Software Testing

SBFT Tool Competition 2024 -- Python Test Case Generation Track

P4Testgen: An Extensible Test Oracle For P4

Testing high performance numerical simulation programs: experience, lessons learned, and open issues

An Empirical Study of Testing File-System-dependent Software with Mock Objects

RT-Bench: an Extensible Benchmark Framework for the Analysis and Management of Real-Time Applications

An Empirical Study on Bugs in Python Interpreters

A Python Benchmark Functions Framework for Numerical Optimisation Problems

Magma: A Ground-Truth Fuzzing Benchmark

PyBench: Evaluating LLM Agent on various real-world coding tasks