2019
DOI: 10.1016/j.dib.2019.104712
|View full text |Cite
|
Sign up to set email alerts
|

Source code analysis dataset

Abstract: The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…The augmented C/C++ dataset, built by extracting 250k observations in C and 250k observations in C++ from a SQL database provided by 29 . They collected source code files from GitHub repositories written in C, C++, Java and Python; extracted comments using Doxygen and condensed such pairs into the database.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The augmented C/C++ dataset, built by extracting 250k observations in C and 250k observations in C++ from a SQL database provided by 29 . They collected source code files from GitHub repositories written in C, C++, Java and Python; extracted comments using Doxygen and condensed such pairs into the database.…”
Section: Methodsmentioning
confidence: 99%
“…For the generation of JAVA, we based our experiments on the Concode corpus 30 . For C++, we used the DECODER OpenCV use case corpus presented above and a larger but less clean corpus, from the “Code and comments dataset” 29 , of substantial size, named “C&C” below. This dataset contains a total of 16,115,540 pairs of comment and code, mined from 106,304 GitHub projects coded in Python, Java, C and C++ 31 .…”
Section: Methodsmentioning
confidence: 99%
“…Context in prior work [25] Code Context in SoCCMiner [20] public void NoEvent(int ix, V val){ // TODO elastic? // TODO elastic?…”
Section: Table I: Code Context Extraction Comparisonmentioning
confidence: 99%
“…To collect a set of naturally written comments by developers, we used the extracted code-comment pair by Gelman et al [22]. This datapool of codecomment pairs holds extracted pairs from GitHub projects for five different programming languages.…”
Section: Dataset Crawling and Extractionmentioning
confidence: 99%
“…C# is only considered in the scope of this work. To pull the code-comment pair, the author of [22] chose the GitHub repository based on having a redistributable license and at least 10 stars. Table 1 represents the list of the C# projects from the datapool, selected for this research.…”
Section: Dataset Crawling and Extractionmentioning
confidence: 99%