Abstract:Abstract-Knowledge management plays a central role in many software development organizations. While much of the important technical knowledge can be captured in documentation, there often exists a gap between the information needs of software developers and the documentation structure. To help developers navigate documentation, we developed a technique for automatically extracting tasks from software documentation by conceptualizing tasks as specific programming actions that have been described in the documen… Show more
“…In many of these cases, the background of the users seems to determine whether they understand a sentence or not. We found a similar situation in our previous work [36] when we asked developers to rate the meaningfulness of task descriptions that we had automatically extracted from their software documentation. In those cases, we argue that displaying such sentences does little harm if some users do not understand them while other users find them useful.…”
Section: Inter-rater Agreementsupporting
confidence: 64%
“…We had developed a set of techniques for preprocessing software documentation in previous work [36,37]. We summarize them here for completeness.…”
Software developers need access to different kinds of information which is often dispersed among different documentation sources, such as API documentation or Stack Overflow. We present an approach to automatically augment API documentation with "insight sentences" from Stack Overflowsentences that are related to a particular API type and that provide insight not contained in the API documentation of that type. Based on a development set of 1,574 sentences, we compare the performance of two state-of-the-art summarization techniques as well as a pattern-based approach for insight sentence extraction. We then present SISE, a novel machine learning based approach that uses as features the sentences themselves, their formatting, their question, their answer, and their authors as well as part-of-speech tags and the similarity of a sentence to the corresponding API documentation. With SISE, we were able to achieve a precision of 0.64 and a coverage of 0.7 on the development set. In a comparative study with eight software developers, we found that SISE resulted in the highest number of sentences that were considered to add useful information not found in the API documentation. These results indicate that taking into account the meta data available on Stack Overflow as well as part-of-speech tags can significantly improve unsupervised extraction approaches when applied to Stack Overflow data.
“…In many of these cases, the background of the users seems to determine whether they understand a sentence or not. We found a similar situation in our previous work [36] when we asked developers to rate the meaningfulness of task descriptions that we had automatically extracted from their software documentation. In those cases, we argue that displaying such sentences does little harm if some users do not understand them while other users find them useful.…”
Section: Inter-rater Agreementsupporting
confidence: 64%
“…We had developed a set of techniques for preprocessing software documentation in previous work [36,37]. We summarize them here for completeness.…”
Software developers need access to different kinds of information which is often dispersed among different documentation sources, such as API documentation or Stack Overflow. We present an approach to automatically augment API documentation with "insight sentences" from Stack Overflowsentences that are related to a particular API type and that provide insight not contained in the API documentation of that type. Based on a development set of 1,574 sentences, we compare the performance of two state-of-the-art summarization techniques as well as a pattern-based approach for insight sentence extraction. We then present SISE, a novel machine learning based approach that uses as features the sentences themselves, their formatting, their question, their answer, and their authors as well as part-of-speech tags and the similarity of a sentence to the corresponding API documentation. With SISE, we were able to achieve a precision of 0.64 and a coverage of 0.7 on the development set. In a comparative study with eight software developers, we found that SISE resulted in the highest number of sentences that were considered to add useful information not found in the API documentation. These results indicate that taking into account the meta data available on Stack Overflow as well as part-of-speech tags can significantly improve unsupervised extraction approaches when applied to Stack Overflow data.
“…One threat to the validity of our results and an opportunity for future work lies in the fact that we used all four NLP libraries with their default settings 9 and without any specialized models. Also, the results are only reflecting the performance and accuracy of the current library versions which might change as the libraries are evolving.…”
Section: Threats To Validitymentioning
confidence: 99%
“…While it is common for researchers to rely on publicly available NLP libraries, some researchers develop their own tooling for specific tasks. For example, Allamanis et al [7] developed a customized system called Haggis for mining code idioms and in our own previous work, we added customizations to the Stanford NLP library to improve the accuracy of parsing natural language text authored by software developers [8], [9]. In this work, we aim to identify how the choice of using a particular publicly available NLP library could impact the results of any research that makes use of an NLP library.…”
Abstract-To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technical language used. In this paper, we first analyze the state of the art through a systematic literature review in which we find that only a small minority of papers justify their choice of an NLP library. We then report on a series of experiments in which we applied four state-of-the-art NLP libraries to publicly available software artifacts from three different sources. Our results show low agreement between different libraries (only between 60% and 71% of tokens were assigned the same part-of-speech tag by all four libraries) as well as differences in accuracy depending on source: For example, spaCy achieved the best accuracy on Stack Overflow data with nearly 90% of tokens tagged correctly, while it was clearly outperformed by Google's SyntaxNet when parsing GitHub ReadMe files. Our work implies that researchers should make an informed decision about the particular NLP library they choose and that customizations to libraries might be necessary to achieve good results when analyzing software artifacts written in natural language.
“…Several tools have been developed that automatically process natural language documents produced by software developers, for example by inferring specification from documentation [29], linking information from bug tracking systems and mailing lists to source code methods [14], summarizing bug reports [19], or extracting tasks from documentation [25]. Many of these tools rely on natural language processing tools such as the Stanford natural language processing toolkit [13] to split sentences, detect words in a sentence, assign parts of speech to words (such as adjective, verb, or noun), and to detect grammatical dependencies between different parts of a sentence (such as subject or direct object).…”
Many tools that automatically analyze, summarize, or transform software artifacts rely on natural language processing tooling for the interpretation of natural language text produced by software developers, such as documentation, code comments, commit messages, or bug reports. Processing natural language text produced by software developers is challenging because of unique characteristics not found in other texts, such as the presence of code terms and the systematic use of incomplete sentences. In addition, texts produced by Portuguese-speaking developers mix languages since many keywords and programming concepts are referred to by their English name. In this paper, we provide empirical insights into the challenges of analyzing software artifacts written in Portuguese. We analyzed 100 question titles from the Portuguese version of Stack Overflow with two Portuguese language tools and identified multiple problems which resulted in very few sentences being tagged completely correctly. Based on these results, we propose heuristics to improve the analysis of natural language text produced by software developers in Portuguese.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.