Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis 2018
DOI: 10.1145/3243127.3243132
|View full text |Cite
|
Sign up to set email alerts
|

A language-agnostic model for semantic source code labeling

Abstract: Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 26 publications
0
8
0
Order By: Relevance
“…e availability of more training data will then allow researchers to use more advanced text embeddings such as BERT [1] and XLNet [19] which are the current state-of-the-art in the NLP eld. Last but not least, di erent lines of work can also be tried such as avoiding preprocessing the dataset and learning character embeddings, similar to what was performed in [3].…”
Section: Discussionmentioning
confidence: 99%
“…e availability of more training data will then allow researchers to use more advanced text embeddings such as BERT [1] and XLNet [19] which are the current state-of-the-art in the NLP eld. Last but not least, di erent lines of work can also be tried such as avoiding preprocessing the dataset and learning character embeddings, similar to what was performed in [3].…”
Section: Discussionmentioning
confidence: 99%
“…Researchers in the articles have also suggested investigating further regarding the suitable metrics and loss functions employed in the evaluation of ML for SE-focused techniques, especially for multi-class classification problems [125].…”
Section: Future Research Directionsmentioning
confidence: 99%
“…To support requirements 2 and 3 (Changes in Code Content and Required Skills), the data pipeline uses Gelman et al's system [8] to generate a set of tags for each code file. These are semantic tags learned from Stack Overflow, such as c++, multithreading, or machine-learning.…”
Section: Data Pipelinementioning
confidence: 99%
“…By searching the open web we find that in September 2017 it was publicly stated that Theano is deprecated and new development will stop. 8 The timing of this announcement comes shortly before the large spike in commit activity near the end of 2017, which corresponds with the last major release of Theano. This announcement also comes around the time we see a sharp decline in the bus factor and the number of new issues.…”
Section: Comparison To Ground Truthmentioning
confidence: 99%