Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments

Omran, Fouad Nasser A Al; Treude, Christoph

doi:10.1109/msr.2017.42

Cited by 78 publications

(34 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Furthermore, we investigated whether part-of-speech patterns indicate question categories, following a similar approach as Chaparro et al (2017) for bug reports. To get the part-of-speech tags, we used spaCy, 5 a Python-based part-of-speech tagger that has been shown to work best for SO data compared to other NLP libraries (Omran and Treude 2017). Using spaCy, we created the part-of-speech tags for the title, the body, and the phrases of a post.…”

Section: Experimental Setup Using Machine Learning Algorithmsmentioning

confidence: 99%

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

et al. 2019

View full text Add to dashboard Cite

On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API USAGE, CONCEPTUAL, and DISCREPANCY are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.

show abstract

Section: Experimental Setup Using Machine Learning Algorithmsmentioning

confidence: 99%

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

et al. 2019

View full text Add to dashboard Cite

show abstract

“…In order to aid developers faced with documentation issues, we conducted an empirical study to understand the written content themes of the README file. The co-founder of GitHub Tom Preston-Werner, even discussed the importance of the README file, coining Readme Driven Development (RDD) 9 as an important subset of Document Driven Development. We learned some valuable lessons along the way: -Lesson 1: Although a README file contains numerous variations, we built a taxonomy of 22 README content themes -Surprisingly, from over 30,000 content theme variations, we were able to build a taxonomy of 22 headline content themes, which are used by more than 1% of packagess.…”

Section: Summary Of Resultsmentioning

confidence: 99%

An Empirical Study of README contents for JavaScript Packages

Ikeda

Ihara

Kula

et al. 2019

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Contemporary software projects often utilize a README.md to share crucial information such as installation and usage examples related to their software. Furthermore, these files serve as an important source of updated and useful documentation for developers and prospective users of the software. Nonetheless, both novice and seasoned developers are sometimes unsure of what is required for a good README file. To understand the contents of README , we investigate the contents of 43,900 JavaScript packages. Results show that these packages contain common content themes (i.e., 'usage', 'install' and 'license'). Furthermore, we find that application-specific packages more frequently included content themes such as 'options', while library-based packages more frequently included other specific content themes (i.e., 'install' and 'license').

show abstract

“…Named Entity Recognition: Named Entity Recognition (NER) tag important words identified in a text content (such as people, organizations, cities, etc.). We use the spaCy library [15] for its efficiency. We apply NER to scene description blocks and discard irrelevant categories such as quantities, ordinals, money etc.. Because many words may end up mislabelled (especially due to the ambiguous context of a sci-fi movie), we manually curate the resulting list of words.…”

Section: Text Processingmentioning

confidence: 99%

Multilayer Network Model of Movie Script

Mourchid¹,

Renoust²,

Cherifi³

2018

Studies in Computational Intelligence

View full text Add to dashboard Cite

Network models have been increasingly used in the past years to support summarization and analysis of narratives, such as famous TV series, books and news. Inspired by social network analysis, most of these models focus on the characters at play. The network model well captures all characters interactions, giving a broad picture of the narration's content. A few works went beyond by introducing additional semantic elements, always captured in a single layer network. In contrast, we introduce in this work a multilayer network model to capture more elements of the narration of a movie from its script: people, locations, and other semantic elements. This model enables new measures and insights on movies. We demonstrate this model on two very popular movies.

show abstract

Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments

Cited by 78 publications

References 18 publications

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

An Empirical Study of README contents for JavaScript Packages

Multilayer Network Model of Movie Script

Contact Info

Product

Resources

About