Evaluating Code Summarization with Improved Correlation with Human Assessment.

Juanjuan Shen,Yu Zhou,Yongchao Wang,Xiang Chen,Tingting Han,Taolue Chen
DOI: https://doi.org/10.1109/qrs54544.2021.00108
2021-01-01
Abstract:Code summarization aims to automatically generate functionality descriptions of code snippets. Faithful metrics are needed to measure to which degree the machine generated summaries capture the semantics of the code snippets. Most commonly used metrics in code summarization, such as BLEU -4, METEOR, and ROUGE-L, originate from machine translation and text summarization, and have constantly been found to be inconsistent with human assessment. In this paper, we propose a novel evaluation metric, Consensus-based Code Summarization Evaluation (CCSE), which assigns different semantic weights to the n-grams of the summary. We also provide an algorithm to match the n-gram pairs from the reference and candidate based on the similarities. To validate the effectiveness of our proposed metric, we collect summary pairs from two public Java datasets and calculate the correlation coefficients between CCSE and the human evaluations. The experiment results show that, compared with BLEU-4, METEOR, and ROUGE-L, CCSE is more consistent with the scores assessed by human developers.
What problem does this paper attempt to address?