Proceedings of the 17th International Conference on Mining Software Repositories 2020
DOI: 10.1145/3379597.3387445
|View full text |Cite
|
Sign up to set email alerts
|

Embedding Java Classes with code2vec

Abstract: Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java m… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 42 publications
(13 citation statements)
references
References 14 publications
0
7
0
Order By: Relevance
“…Therefore, semantic tasks should use upper-level layers to represent code to achieve the best performance. Interestingly, the performance remains uniform in the middle layers (4)(5)(6)(7)(8)(9)(10). This can be related to the uniform attention of identifiers, values we see in the Figure 4.…”
Section: Semantic Representation Of Code For Code Clone Detection Usi...mentioning
confidence: 55%
See 1 more Smart Citation
“…Therefore, semantic tasks should use upper-level layers to represent code to achieve the best performance. Interestingly, the performance remains uniform in the middle layers (4)(5)(6)(7)(8)(9)(10). This can be related to the uniform attention of identifiers, values we see in the Figure 4.…”
Section: Semantic Representation Of Code For Code Clone Detection Usi...mentioning
confidence: 55%
“…They compare the performance of these models from the perspective of probing classifiers. Other studies [5,9,35,49], show that identifiers are important code entities and can be used in the modeling of Transfoemr based models. This work presents the first study in software engineering, which analyses the multi-headed attention framework of BERT, which is not done previously [21].…”
Section: Related Workmentioning
confidence: 99%
“…While this work focused on extracting a set of handcrafted features for better transparency, we study how feature enrichment affects in model's training behavior. Recent studies have shown that state-of-the-art models heavily rely on variables [13,28], specific tokens [29], and even structures [30]. Chen et al [31] focus on semantic representations of program variables, and study how well models can learn similarity between variables that have similar meaning (e.g., minimum and minimal).…”
Section: Related Workmentioning
confidence: 99%
“…Adversarial attacks. Related work stems from Compton et al [20] that introduces randomization of variable-names in the training dataset of a code2vec model for training data augmentation. Their study shows that the model trained on the augmented training dataset achieves slightly better accuracy than the model trained on the original dataset which motivates to systematically investigate for overfitting.…”
Section: Background and Related Workmentioning
confidence: 99%