Abstract:A fundamental problem of finding applications that are highly relevant to development tasks is the mismatch between the high-level intent reflected in the descriptions of these tasks and low-level implementation details of applications. To reduce this mismatch we created an approach called Exemplar (EXEcutable exaMPLes ARchive) for finding highly relevant software projects from large archives of applications. After a programmer enters a naturallanguage query that contains high-level concepts (e.g., MIME, data … Show more
“…We assess the efficiency of the engines through the Mean Reciprocal Rank (MRR), a statistical metric used to evaluate a process that produces a list of possible responses to a query [18]. The reciprocal rank of a query is the multiplicative inverse of the rank of the first relevant answer.…”
Section: Rq3: Comparison Against General Search Enginesmentioning
Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users complete solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.
“…We assess the efficiency of the engines through the Mean Reciprocal Rank (MRR), a statistical metric used to evaluate a process that produces a list of possible responses to a query [18]. The reciprocal rank of a query is the multiplicative inverse of the rank of the first relevant answer.…”
Section: Rq3: Comparison Against General Search Enginesmentioning
Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users complete solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.
“…The terms added can come from a variety of thesauruses [66], rule systems mapping keywords to related terms [28], related Java documentation [41], or from the code the developer is currently writing [12]. For example, Lemos et al [66] found that, when queries were automatically expanded with synonyms from the WordNet [135] thesaurus, it increased recall of CodeGenie [65] by 30% (i.e., query expansion allowed CodeGenie to return more on topic results that otherwise would not have been returned).…”
Section: Improving Ranking Algorithms With Automatic Query Modificationmentioning
“…Since programs contain API calls with precisely defined semantics, these API calls can serve as semantic anchors to compute the degree of similarity between requirements and artifacts by matching the semantics of these applications that are expressed with these API calls. Programmers routinely use third-party API calls (e.g., the Java Development Kit (JDK)) to implement various requirements [10,21,30,31,47]. API calls from well-known and widely used libraries have precisely defined semantics-unlike names of program variables, types, and words that programmers use in comments.…”
Abstract-Software traceability is the ability to describe and follow the life of a requirement in both a forward and backward direction by defining relationships to related development artifacts. A plethora of different traceability recovery approaches use information retrieval techniques, which depend on the quality of the textual information in requirements and software artifacts. Not only is it important that stakeholders use meaningful names in these artifacts, but also it is crucial that the same names are used to specify the same concepts in different artifacts. Unfortunately, the latter is difficult to enforce and as a result, software traceability approaches are not as efficient and effective as they could be -to the point where it is questionable whether the anticipated economic and quality benefits were indeed achieved.We propose a novel and automatic approach for expanding corpora with relevant documentation that is obtained using external function call documentation and sets of relevant words, which we implemented in TraceLab. We experimented with three Java applications and we show that using our approach the precision of recovering traceability links was increased by up to 31% in the best case and by approximately 9% on average.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.