A Neural Model for Generating Natural Language Summaries of Program Subroutines

LeClair, Alexander; Jiang, Siyuan; McMillan, Collin

doi:10.1109/icse.2019.00087

Cited by 251 publications

(299 citation statements)

References 51 publications

Supporting

Mentioning

294

Contrasting

Order By: Relevance

“…Problem Statement The problem we target in this paper is the extraction of summary descriptions from unstructured subroutine comments. By "summary descriptions," we mean a short natural language explanation of code behavior or purpose (maximum 12 words, in line with related work [10], which found that most summary descriptions consisted of fewer than 13 word tokens). By "unstructured subroutine comments," we mean the long comments immediately preceding methods in source code.…”

Section: Introductionmentioning

confidence: 82%

“…Referring to the process as "summarization" alludes to a history of work in Natural Language Processing of extractive summarization of documents -early attempts at code summarization involved choosing a set of n important words from code [18], [19] and then converting those words into complete sentences by placing them into sentence templates [2], [20]- [22]. A 2016 survey [23] highlights these approaches around the time that a vast majority of code summarization techniques began to be based on neural networks trained from big data input [10], [14], [24]- [27]. These NN-based approaches have proliferated, but suffer an Achilles' heel of reliance on very large, clean datasets of examples of code comments.…”

Section: A Source Code Summarizationmentioning

confidence: 99%

“…[10], [24], [26], [27]) have explored neural-based representations of source code for the task of summarization. We integrate a recent representation described by LeClair et al [10] at ICSE'19 into the BiLSTM approach. Essentially, we augmented the encoder to accept the function source code as another input alongside the comment, but otherwise left the BiLSTM the same.…”

Section: Bilstm+f: Summary From Comment and Source Codementioning

confidence: 99%

“…Therefore, a second strategy is to automatically generate the summaries based on patterns learned from big data input. This second strategy saves significant human effort, but relies on large numbers (on the order of millions [10]) of high-quality example summaries for learning. These examples are usually extracted from metadata within large code repositories, but suitable metadata is scarce.…”

Section: Introductionmentioning

confidence: 99%

“…There does exist a large, untapped resource of summary descriptions in the form of unstructured header comments found in source code. Unstructured comments are much more numerous than the well-structured ones in metadata (over 3x as many in one dataset [10]), but are much longer and more expansive in scope than short summary descriptions. As we show later in this paper, these comments nearly always have a short summary description embedded in them, but the summary may occur in many locations: surrounded by different text, commented-out code, or even diagrams or logos as ASCII art.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments

Eberhart

LeClair

McMillan

2020

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Self Cite

View full text Add to dashboard Cite

descriptions of subroutines are short (usually one-sentence) natural language explanations of a subroutine's behavior and purpose in a program. These summaries are ubiquitous in documentation, and many tools such as JavaDocs and Doxygen generate documentation built around them. And yet, extracting summaries from unstructured source code repositories remains a difficult research problem -it is very difficult to generate clean structured documentation unless the summaries are annotated by programmers. This becomes a problem in large repositories of legacy code, since it is cost prohibitive to retroactively annotate summaries in dozens or hundreds of old programs. Likewise, it is a problem for creators of automatic documentation generation algorithms, since these algorithms usually must learn from large annotated datasets, which do not exist for many programming languages. In this paper, we present a semi-automated approach via crowdsourcing and a fully-automated approach for annotating summaries from unstructured code comments. We present experiments validating the approaches, and provide recommendations and cost estimates for automatically annotating large repositories.

show abstract

Section: Introductionmentioning

confidence: 82%

Section: A Source Code Summarizationmentioning

confidence: 99%

Section: Bilstm+f: Summary From Comment and Source Codementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments

Eberhart

LeClair

McMillan

2020

2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)

Self Cite

View full text Add to dashboard Cite

show abstract

PassSum: Leveraging paths of abstract syntax trees and self‐supervision for code summarization

Niu,

Li,

et al. 2023

J Software Evolu Process

View full text Add to dashboard Cite

Code summarization is to provide a high‐level comment for a code snippet that typically describes the function and intent of the given code. Recent years have seen the successful application of data‐driven code summarization. To improve the performance of the model, numerous approaches use abstract syntax trees (ASTs) to represent the structural information of the code, which is considered by most researchers to be the main factor that distinguishes code from natural language. Then, such data‐driven methods are trained on large‐scale labeled datasets to obtain a model with strong generalization capabilities that can be applied to new examples. Nevertheless, we argue that state‐of‐the‐art approaches suffer from two key weaknesses: (1) inefficient encoding of ASTs; (2) reliance on a large labeled corpus for model training. As a result, such drawbacks lead to (1) oversized model, slow training, information loss and instability; (2) inability to be applied to programming languages with only a small amount of labeled data. In light of these weaknesses, we propose PassSum, a code summarization approach that addresses the aforementioned weaknesses via (1) a novel input representation which contains an efficient AST encoding method; (2) introducing three pretraining objectives and pretraining our model with a large amount of (easy‐to‐obtain) unlabeled data under the guidance of self‐supervised learning. Experimental results on code summarization for Java, Python, and Ruby methods demonstrate the superiority of PassSum to state‐of‐the‐art methods. Further experiments demonstrate that the input representation we use has both temporal and spatial advantages in addition to performance leadership. In addition, pretraining is also shown to make the model more generalizable with less labeled data, and also to speed up the convergence of the model during training.

show abstract

Effective approaches to combining lexical and syntactical information for code summarization

Zhou

Fan

2020

Softw Pract Exp

View full text Add to dashboard Cite

Summary Natural language summaries of source codes are important during software development and maintenance. Recently, deep learning based models have achieved good performance on the task of automatic code summarization, which encode token sequence or abstract syntax tree (AST) of code with neural networks. However, there has been little work on the efficient combination of lexical and syntactical information of code for better summarization quality. In this paper, we propose two general and effective approaches to leveraging both types of information: a convolutional neural network that aims to better extract vector representation of AST node for downstream models; and a Switch Network that learns an adaptive weight vector to combine different code representations for summary generation. We integrate these approaches into a comprehensive code summarization model, which includes a sequential encoder for token sequence of code and a tree based encoder for its AST. We evaluate our model on a large Java dataset. The experimental results show that our model outperforms several state‐of‐the‐art models on various metrics, and the proposed approaches contribute a lot to the improvements.

show abstract

A Neural Model for Generating Natural Language Summaries of Program Subroutines

Cited by 251 publications

References 51 publications

Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments

Automatically Extracting Subroutine Summary Descriptions from Unstructured Comments

PassSum: Leveraging paths of abstract syntax trees and self‐supervision for code summarization

Effective approaches to combining lexical and syntactical information for code summarization

Contact Info

Product

Resources

About