Improving software productivity and quality via mining source code

Tao Xie,Suresh Thummalapenta
2011-01-01
Abstract:The major goal of software development is to deliver high-quality software efficiently. To achieve this goal of delivering high-quality software efficiently, programmers often reuse existing frameworks or libraries, hereby referred to as libraries, instead of developing similar code artifacts from the scratch. However, programmers often face challenges in reusing existing libraries due to two major factors. First, many existing libraries are not well-documented. Even when such documentations exist, they are often outdated. Second, many existing libraries expose a large number of application programming interfaces (APIs), which represent interfaces through which libraries expose their functionalities. For example, the .NET base library provides nearly 10,000 API classes. Due to these two preceding factors, there exist three major problems that affect both software productivity and quality. First, programmers often spend more time in reusing existing libraries, thereby reducing software productivity. Second, programmers introduce defects while using APIs due to lack of proper knowledge on how to reuse those APIs. Third, existing white-box test generation techniques face challenges in effectively generating test inputs for the client code that reuses libraries. To address these three preceding issues, in this dissertation, we propose a general framework, called WebMiner, that uses existing open source code available on the web by leveraging a code search engine. In particular, WebMiner infers usage specifications for API methods under analysis by automatically collecting relevant code examples from the open source code available on the web. WebMiner next applies data mining techniques on those collected code examples to identify common patterns, which represent likely usage of APIs, referred to as API usage specifications. The primary reason for identifying common patterns is based on the observation that majority of the programmers correctly adhere to API usage specifications and those common patterns are likely to represent the correct usage of APIs. We further propose six approaches based on our general framework, where each approach focuses on a specific software engineering (SE) task such as detecting defects in an application under analysis. In particular, the first two approaches assist programmers in effectively reusing APIs provided by existing libraries. The next two approaches use mined API usage specifications as programming rules and detect defects in applications under analysis as deviations from the mined specifications. Finally, the last two approaches mine static and dynamic traces, respectively, for effectively generating test inputs that achieve high structural coverage of the code under test. We also propose another approach that addresses a major issue with mining-based approaches, which are not effective in scenarios where usage information is not available for the API methods under analysis or usage information is not sufficient to achieve the SE task under analysis. Our empirical results show that the approaches developed based on our WebMiner framework effectively address the respective SE tasks handled by those approaches. In particular, our empirical results demonstrate the effectiveness of expanding the data scope of mining-based approaches to large open source code available on the web. Our results also show that our approaches address queries posted in developer forums and detect new defects that are not detected by existing related approaches, thereby improving both software productivity and quality.
What problem does this paper attempt to address?