Regexes are Hard: Decision-making, Difficulties, and Risks in Programming Regular Expressions

Louis G. Michael IV,James Donohue,James C. Davis,Dongyoon Lee,Francisco Servant
2023-03-05
Abstract:Regular expressions (regexes) are a powerful mechanism for solving string-matching problems. They are supported by all modern programming languages, and have been estimated to appear in more than a third of Python and JavaScript projects. Yet existing studies have focused mostly on one aspect of regex programming: readability. We know little about how developers perceive and program regexes, nor the difficulties that they face. In this paper, we provide the first study of the regex development cycle, with a focus on (1) how developers make decisions throughout the process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes. We took a mixed-methods approach, surveying 279 professional developers from a diversity of backgrounds (including top tech firms) for a high-level perspective, and interviewing 17 developers to learn the details about the difficulties that they face and the solutions that they prefer. In brief, regexes are hard. Not only are they hard to read, our participants said that they are hard to search for, hard to validate, and hard to document. They are also hard to master: the majority of our studied developers were unaware of critical security risks that can occur when using regexes, and those who knew of the risks did not deal with them in effective manners. Our findings provide multiple implications for future work, including semantic regex search engines for regex reuse and improved input generators for regex validation.
Software Engineering
What problem does this paper attempt to address?
The paper primarily explores the challenges developers face, their decision-making processes, and their awareness of related risks when using regular expressions (regexes) in programming. The study collects data through surveys and interviews, aiming to comprehensively understand developers' mental models and practical difficulties when using regular expressions. The core issues addressed by the paper are: 1. **How do developers perceive the value and difficulty of regular expressions?** Most developers believe that regular expressions are valuable to their work but also find them difficult to understand and use. 2. **What factors influence developers' decisions to use regular expressions in programming?** Developers decide whether to use regular expressions based on factors such as the complexity of the problem and readability, and whether to write new expressions or reuse existing ones. 3. **What difficulties do developers encounter when programming with regular expressions?** The main difficulties include understanding the specific problem, interpreting the syntax of regular expressions, and determining test data. 4. **How do developers cope with these difficulties?** Common coping strategies include studying sample inputs to find patterns, breaking the problem into smaller parts, and using tools to assist with validation. 5. **Are developers aware of the portability and security risks associated with using regular expressions?** Many developers are not fully aware of these issues, especially performance-related risks such as Regular Expression Denial of Service (ReDoS) vulnerabilities. The study employs a mixed-methods approach, combining quantitative and qualitative data collection methods, gathering information through a survey of 279 professional developers and in-depth interviews with 17 developers. This helps to reveal the real experiences and feelings of developers in the process of programming with regular expressions and provides valuable insights for further research.