Reassessing automatic evaluation metrics for code summarization tasks

Roy, Devjeet; Fakhoury, Sarah; Arnaoudova, Venera

doi:10.1145/3468264.3468588

Cited by 50 publications

(39 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also emphasize that even the minor improvements provided here by multilingual training (which is broadly compatible with a range of settings) constitute a relevant and potentially widely useful result. Roy et al [58] have previously noted that small gains in BLEU-4 may not be perceptible to humans as increased text quality; nevertheless, we note that natural language translation (which is now widely used) attained high performance levels based on decades of incremental progress; this result and others below provide evidence that multilingual training could be an important step in the progress towards more useful automated tools. Finally, we note that BLEU-4 gains are higher for low-resource language (e.g., 17.7% for Ruby), and lower for high-resource languages (e.g., 2.5% for Python), as expected.…”

Section: Code Summarizationsupporting

confidence: 45%

“…We fine-tune with NVIDIA TITAN RTX, while Feng et al [18] use NVIDIA Tesla V100). (2) We use a pairwise two-sample statistical test (as described in [58], it is more precise than just comparing test-set summary statistics) to gauge differences. This requires a performance measurement for each test sample, which the repository did not include.…”

Section: Code Summarizationmentioning

confidence: 99%

“…We use a one-sided (AH: monolingual < multilingual) pairwise Wilcoxon signed-rank test (thus avoiding the corpus-level measurement pitfalls noted in [58]). Null hypothesis is rejected for all six languages, for C BERT.…”

Section: Code Summarizationmentioning

confidence: 99%

“…However, other recognized metrics are are available (e.g., ROUGE-L [43], METEOR [10]). Prior works [21,58,60] provide a careful analysis of the metrics, baselines, evaluations for code summarization task. Table 10 shows ROUGE-L and ME-TEOR data; we find that multilingual fine-tuning increases the overall performance by 4.89% and 5.61% in ROUGE-L and METEOR, respectively.…”

Section: Threats: Other Metrics?mentioning

confidence: 99%

“…Code summarization: Code summarization is getting much attention recently. More than 30 papers have been published in the last five years that follow some form of encoder-decoder architecture [58]. Several works [21,58,60] discuss the evaluations, metrics, and baselining.…”

Section: Related Workmentioning

confidence: 99%

See 4 more Smart Citations

Multilingual training for Software Engineering

Ahmed,

Devanbu

2021

Preprint

View full text Add to dashboard Cite

Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this dataaugmenting approach is broadly compatible with different tasks, languages, and machine-learning models.

show abstract

Section: Code Summarizationsupporting

confidence: 45%

Section: Code Summarizationmentioning

confidence: 99%

Section: Code Summarizationmentioning

confidence: 99%

Section: Threats: Other Metrics?mentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Multilingual training for Software Engineering

Ahmed,

Devanbu

2021

Preprint

View full text Add to dashboard Cite

show abstract

Semantic similarity loss for neural source code summarization

Su,

McMillan

2024

J Software Evolu Process

View full text Add to dashboard Cite

This paper presents a procedure for and evaluation of using a semantic similarity metric as a loss function for neural source code summarization. Code summarization is the task of writing natural language descriptions of source code. Neural code summarization refers to automated techniques for generating these descriptions using neural networks. Almost all current approaches involve neural networks as either standalone models or as part of a pretrained large language models, for example, GPT, Codex, and LLaMA. Yet almost all also use a categorical cross‐entropy (CCE) loss function for network optimization. Two problems with CCE are that (1) it computes loss over each word prediction one‐at‐a‐time, rather than evaluating a whole sentence, and (2) it requires a perfect prediction, leaving no room for partial credit for synonyms. In this paper, we extend our previous work on semantic similarity metrics to show a procedure for using semantic similarity as a loss function to alleviate this problem, and we evaluate this procedure in several settings in both metrics‐driven and human studies. In essence, we propose to use a semantic similarity metric to calculate loss over the whole output sentence prediction per training batch, rather than just loss for each word. We also propose to combine our loss with CCE for each word, which streamlines the training process compared to baselines. We evaluate our approach over several baselines and report improvement in the vast majority of conditions.

show abstract

The Unexpected Blocking of Code Understanding in AI-Based Code Summarization: Observations and Concerns from a Study on Cross-Project Learning Performance

Zhang,

Liu

2024

Reuse and Software Quality

View full text Add to dashboard Cite

Reassessing automatic evaluation metrics for code summarization tasks

Cited by 50 publications

References 47 publications

Multilingual training for Software Engineering

Multilingual training for Software Engineering

Semantic similarity loss for neural source code summarization

The Unexpected Blocking of Code Understanding in AI-Based Code Summarization: Observations and Concerns from a Study on Cross-Project Learning Performance

Contact Info

Product

Resources

About