On the generation, structure, and semantics of grammar patterns in source code identifiers

Newman, Christian D.; Alsuhaibani, Reem S.; Decker, Michael John; Peruma, Anthony; Kaushik, Dishant; Mkaouer, Mohamed Wiem; Hill, Emily

doi:10.1016/j.jss.2020.110740

Cited by 29 publications

(65 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, the ensemble has been made fully available (see Section IV), and is intended for long term support by the research team as we expand the training set and include identifiers from different contexts (e.g., test code). 2) Confirmation of observations we made in prior work [25] that indicate 1) the importance of the position of a word in an identifier, 2) the importance of the context of an identifier when annotating using part-of-speech, and 3) the complementarity of three part-of-speech taggers. 3) An expanded set of manually-annotated identifiers, based on the original set constructed in [25], that can be used to train and create other tagging approaches or for other natural language problems.…”

Section: Introductionsupporting

confidence: 68%

“…2) Confirmation of observations we made in prior work [25] that indicate 1) the importance of the position of a word in an identifier, 2) the importance of the context of an identifier when annotating using part-of-speech, and 3) the complementarity of three part-of-speech taggers. 3) An expanded set of manually-annotated identifiers, based on the original set constructed in [25], that can be used to train and create other tagging approaches or for other natural language problems. As with the implementation, this has been made fully available to the research community (see Section IV).…”

Section: Introductionsupporting

confidence: 68%

“…Part-of-speech tagging is one of the most popular methods for measuring the natural language semantics of identifier names and has been used in numerous other research [19], [17], [20], [21], [22], [23], [14], [13], [24]. Unfortunately, part-of-speech taggers for identifiers are still inaccurate [25], [18], making it difficult to trust their output.…”

Section: Introductionmentioning

confidence: 99%

“…1) An implementation of the most accurate (to-date) part-ofspeech tagger for source code identifiers, built using data that was curated via significant manual-annotation effort made by the authors in prior work [25]. This approach is trained to support more types of annotations (i.e., POS tags) specifically oriented for source code than any other approach currently available.…”

Section: Introductionmentioning

confidence: 99%

“…As with the implementation, this has been made fully available to the research community (see Section IV). 4) A thorough evaluation of the ensemble approach at both the identifier-and word-level, including a discussion of the features, which were empirically derived by the authors in prior work [25], that most positively influence the tagger's performance. 5) A discussion that provides a clear path for future work on part-of-speech tagger accuracy and effectiveness.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Newman,

Decker,

AlSuhaibani

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE, and Stanford. We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names. We also study and discuss the weaknesses of our tagger to promote the future amelioration of these problems through further research. Our results show that the ensemble achieves 75% accuracy at the identifier level and 84-86% accuracy at the word level. This is an increase of +17% points at the identifier level from the closest independent partof-speech tagger.

show abstract

Section: Introductionsupporting

confidence: 68%

Section: Introductionsupporting

confidence: 68%