2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) 2017
DOI: 10.1109/msr.2017.42
|View full text |Cite
|
Sign up to set email alerts
|

Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments

Abstract: Abstract-To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technical language used. In this paper, we first analyze the state of the art through a systematic literature review in which we find that only a small minority of papers justify their choice of an NLP library. We then repor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
33
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 78 publications
(34 citation statements)
references
References 18 publications
0
33
0
1
Order By: Relevance
“…Furthermore, we investigated whether part-of-speech patterns indicate question categories, following a similar approach as Chaparro et al (2017) for bug reports. To get the part-of-speech tags, we used spaCy, 5 a Python-based part-of-speech tagger that has been shown to work best for SO data compared to other NLP libraries (Omran and Treude 2017). Using spaCy, we created the part-of-speech tags for the title, the body, and the phrases of a post.…”
Section: Experimental Setup Using Machine Learning Algorithmsmentioning
confidence: 99%
“…Furthermore, we investigated whether part-of-speech patterns indicate question categories, following a similar approach as Chaparro et al (2017) for bug reports. To get the part-of-speech tags, we used spaCy, 5 a Python-based part-of-speech tagger that has been shown to work best for SO data compared to other NLP libraries (Omran and Treude 2017). Using spaCy, we created the part-of-speech tags for the title, the body, and the phrases of a post.…”
Section: Experimental Setup Using Machine Learning Algorithmsmentioning
confidence: 99%
“…In order to aid developers faced with documentation issues, we conducted an empirical study to understand the written content themes of the README file. The co-founder of GitHub Tom Preston-Werner, even discussed the importance of the README file, coining Readme Driven Development (RDD) 9 as an important subset of Document Driven Development. We learned some valuable lessons along the way: -Lesson 1: Although a README file contains numerous variations, we built a taxonomy of 22 README content themes -Surprisingly, from over 30,000 content theme variations, we were able to build a taxonomy of 22 headline content themes, which are used by more than 1% of packagess.…”
Section: Summary Of Resultsmentioning
confidence: 99%
“…Named Entity Recognition: Named Entity Recognition (NER) tag important words identified in a text content (such as people, organizations, cities, etc.). We use the spaCy library [15] for its efficiency. We apply NER to scene description blocks and discard irrelevant categories such as quantities, ordinals, money etc.. Because many words may end up mislabelled (especially due to the ambiguous context of a sci-fi movie), we manually curate the resulting list of words.…”
Section: Text Processingmentioning
confidence: 99%