Embedding Java Classes with code2vec

Rhys, Compton,; Frank, Eibe; Patros, Panos; Koay, Abigail

doi:10.1145/3379597.3387445

Cited by 42 publications

(13 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, semantic tasks should use upper-level layers to represent code to achieve the best performance. Interestingly, the performance remains uniform in the middle layers (4)(5)(6)(7)(8)(9)(10). This can be related to the uniform attention of identifiers, values we see in the Figure 4.…”

Section: Semantic Representation Of Code For Code Clone Detection Usi...mentioning

confidence: 55%

“…They compare the performance of these models from the perspective of probing classifiers. Other studies [5,9,35,49], show that identifiers are important code entities and can be used in the modeling of Transfoemr based models. This work presents the first study in software engineering, which analyses the multi-headed attention framework of BERT, which is not done previously [21].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

An Exploratory Study on Code Attention in BERT

Sharma¹,

Chen²,

Fard³

et al. 2022

Preprint

View full text Add to dashboard Cite

Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformerbased Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled.Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1score in its lower layers and the upper layers, respectively. When identifiers' embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21-24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.

show abstract

Section: Semantic Representation Of Code For Code Clone Detection Usi...mentioning

confidence: 55%

Section: Related Workmentioning

confidence: 99%

An Exploratory Study on Code Attention in BERT

Sharma¹,

Chen²,

Fard³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…While this work focused on extracting a set of handcrafted features for better transparency, we study how feature enrichment affects in model's training behavior. Recent studies have shown that state-of-the-art models heavily rely on variables [13,28], specific tokens [29], and even structures [30]. Chen et al [31] focus on semantic representations of program variables, and study how well models can learn similarity between variables that have similar meaning (e.g., minimum and minimal).…”

Section: Related Workmentioning

confidence: 99%

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Hussain¹,

Rabin²,

Xu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introduced in the works of Sajaniemi et al. [1,2]) has been found to help students' abilities in programming. In this paper, we investigate if this notion would improve the performance of neural models of code. To the best of our knowledge, this is the first work to investigate how Sajaniemi et al.'s concept of variable roles can affect neural models of code. In particular, we enrich a source code dataset by adding the role of individual variables in the dataset programs, and thereby conduct a study on the impact of variable role enrichment in training the Code2Seq model. In addition, we shed light on some challenges and opportunities in feature enrichment for neural code intelligence models.

show abstract

“…Adversarial attacks. Related work stems from Compton et al [20] that introduces randomization of variable-names in the training dataset of a code2vec model for training data augmentation. Their study shows that the model trained on the augmented training dataset achieves slightly better accuracy than the model trained on the original dataset which motivates to systematically investigate for overfitting.…”

Section: Background and Related Workmentioning

confidence: 99%

Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations

Applis

Panichella

Deursen

2021

2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)

View full text Add to dashboard Cite

Metamorphic testing is a well-established testing technique that has been successfully applied in various domains, including testing deep learning models to assess their robustness against data noise or malicious input. Currently, metamorphic testing approaches for machine learning (ML) models focused on image processing and object recognition tasks. Hence, these approaches cannot be applied to ML targeting program analysis tasks. In this paper, we extend metamorphic testing approaches for ML models targeting software programs. We present LAMPION, a novel testing framework that applies (semantics preserving) metamorphic transformations on the test datasets. LAMPION produces new code snippets equivalent to the original test set but different in their identifiers or syntactic structure. We evaluate LAMPION against CodeBERT, a state-of-the-art ML model for Code-To-Text tasks that creates Javadoc summaries for given Java methods. Our results show that simple transformations significantly impact the target model behavior, providing additional information on the models reasoning apart from the classic performance metric.

show abstract

Embedding Java Classes with code2vec

Cited by 42 publications

References 14 publications

An Exploratory Study on Code Attention in BERT

An Exploratory Study on Code Attention in BERT

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations

Contact Info

Product

Resources

About