Predicting the Programming Language: Extracting Knowledge from Stack Overflow Posts

Baquero, Juan F.; Camargo, Jorge E.; Restrepo-Calle, Felipe; Aponte, Jairo; González, Fabio A.

doi:10.1007/978-3-319-66562-7_15

Cited by 7 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Overflow questions to predict their programming language. This classifier achieves an accuracy of 81.1%, a precision of 0.83 and a recall of 0.81 which is much higher than the previous best model (Baquero et al [13]).…”

Section: ) a Classifier That Uses Only Textual Information In Stackmentioning

confidence: 68%

See 1 more Smart Citation

SCC++: Predicting the programming language of questions and snippets of Stack Overflow

Alrashedy

Dharmaretnam

Germán

et al. 2020

Journal of Systems and Software

View full text Add to dashboard Cite

Section: ) a Classifier That Uses Only Textual Information In Stackmentioning

confidence: 68%

“…Baquero et al [13] proposed a classifier to predict the programming language of a Stack Overflow question. They extracted a set of 18000 questions from Stack Overflow that contained text and code snippets, 1000 questions for each of 18 programming languages.…”

Section: Related Workmentioning

confidence: 99%

SCC++: Predicting the programming language of questions and snippets of Stack Overflow

Alrashedy

Dharmaretnam

Germán

et al. 2020

Journal of Systems and Software

View full text Add to dashboard Cite

“…Dam and Zaytsev [19] utilized statistical language models such as n-grams and skip grams in natural language processing (NLP) for programming language identification. Baquero et al [4] proposed a model to predict the programming language from both comment text data and code snippets of Stack Overflow questions. They used Word2Vec for text feature extraction and n-gram for source code feature extraction.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, some This article is part of the topical collection "Deep learning approaches for data analysis: A practical perspective" guest edited by D. Jude Hemanth, Lipo Wang and Anastasia Angelopoulou. automatic source code classification methods have been developed based on text classification [1][2][3][4]. In these studies, the source codes are considered as text, and classification methods are applied with the help of natural language processing (NLP) techniques.…”

Section: Introductionmentioning

confidence: 99%

Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

et al. 2020

View full text Add to dashboard Cite

Source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fastgrowing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although textbased SCC and image-based SCC approaches achieve very high ( > 93.5% ) and similar accuracies, text-based classification has significantly better performance in terms of execution time.

show abstract

“…In this paper, we are interested in a tool that can classify a code snippet which is a small block reusable code with at least two lines of code, a much more challenging task. The only previous work that studies classification of the programming languages from a code snippet or a few lines of source code is the work of Baquero et al [11]. However, they achieve low accuracy showing that identifying programming languages from a small source code or a code snippet is much harder than larger pieces.…”

Section: Introductionmentioning

confidence: 99%

[Engineering Paper] SCC: Automatic Classification of Code Snippets

Alreshedy

Dharmaretnam

Germán

et al. 2018

2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

View full text Add to dashboard Cite

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI-a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.

show abstract

Predicting the Programming Language: Extracting Knowledge from Stack Overflow Posts

Cited by 7 publications

References 12 publications

SCC++: Predicting the programming language of questions and snippets of Stack Overflow

SCC++: Predicting the programming language of questions and snippets of Stack Overflow

Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

[Engineering Paper] SCC: Automatic Classification of Code Snippets

Contact Info

Product

Resources

About