An Empirical Study on Focal Methods in Deep-Learning-Based Approaches for Assertion Generation

Yibo He,Jiaming Huang,Hao Yu,Tao Xie
DOI: https://doi.org/10.1145/3660785
2024-01-01
Abstract:Unit testing is widely recognized as an essential aspect of the software development process. Generating high-quality assertions automatically is one of the most important and challenging problems in automatic unit test generation. To generate high-quality assertions, deep-learning-based approaches have been proposed in recent years. For state-of-the-art d eep- l earning-based approaches for a ssertion g eneration (DLAGs), the focal method (i.e., the main method under test) for a unit test case plays an important role of being a required part of the input to these approaches. To use DLAGs in practice, there are two main ways to provide a focal method for these approaches: (1) manually providing a developer-intended focal method or (2) identifying a likely focal method from the given test prefix (i.e., complete unit test code excluding assertions) with test-to-code traceability techniques. However, the state-of-the-art DLAGs are all evaluated on the ATLAS dataset, where the focal method for a test case is assumed as the last non-JUnit-API method invoked in the complete unit test code (i.e., code from both the test prefix and assertion portion). There exist two issues of the existing empirical evaluations of DLAGs, causing inaccurate assessment of DLAGs toward adoption in practice. First, it is unclear whether the last method call before assertions (LCBA) technique can accurately reflect developer-intended focal methods. Second, when applying DLAGs in practice, the assertion portion of a unit test is not available as a part of the input to DLAGs (actually being the output of DLAGs); thus, the assumption made by the ATLAS dataset does not hold in practical scenarios of applying DLAGs. To address the first issue, we conduct a study of seven test-to-code traceability techniques in the scenario of assertion generation. We find that the LCBA technique is not the best among the seven techniques and can accurately identify focal methods with only 43.38% precision and 38.42% recall; thus, the LCBA technique cannot accurately reflect developer-intended focal methods, raising a concern on using the ATLAS dataset for evaluation. To address the second issue along with the concern raised by the preceding finding, we apply all seven test-to-code traceability techniques , respectively, to identify focal methods automatically from only test prefixes and construct a new dataset named ATLAS+ by replacing the existing focal methods in the ATLAS dataset with the focal methods identified by the seven traceability techniques, respectively. On a test set from new ATLAS+, we evaluate four state-of-the-art DLAGs trained on a training set from the ATLAS dataset. We find that all of the four DLAGs achieve lower accuracy on a test set in ATLAS+ than the corresponding test set in the ATLAS dataset, indicating that DLAGs should be (re)evaluated with a test set in ATLAS+, which better reflects practical scenarios of providing focal methods than the ATLAS dataset. In addition, we evaluate state-of-the-art DLAGs trained on training sets in ATLAS+. We find that using training sets in ATLAS+ helps effectively improve the accuracy of the ATLAS approach and T5 approach over these approaches trained using the corresponding training set from the ATLAS dataset.
What problem does this paper attempt to address?