Semantic Source Code Models Using Identifier Embeddings

Efstathiou, Vasiliki; Spinellis, Diomidis

doi:10.1109/msr.2019.00015

Cited by 18 publications

(14 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, (Pradel and Sen 2018) proposed DeepBugs to identify name-based bug detection using semantic representations of code. Likewise, (Efstathiou and Spinellis 2019) proposed distributed code representations for six different programming languages: Java, Python, PHP, C, C++, and C#. They used fastText for learning semantic representations and studied dissimilarities between code and natural language, proposing various applications and limitations.…”

Section: Related Workmentioning

confidence: 99%

“…For instance, we started with the metrics suggested by Pimentel and colleagues (2019) and others (Biswas et al, 2019) such as the length of notebook titles, the placement of imports, the presence of dependency requirements files, and the use of relative paths to access the data. Similarly, for the DL-based approach, we adopted a highly successful approach that has been developed recently by the DL community: automatically learn suitable representations, i.e., embeddings (Efstathiou & Spinellis, 2019). Such representations are known to improve the performance of downstream learning tasks or applications such as contextual search and analogical reasoning in the case of natural language semantics.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automated Assessment of Quality of Jupyter Notebooks Using Artificial Intelligence and Big Code

Oli

Banjade

Tamang

et al. 2021

FLAIRS

View full text Add to dashboard Cite

We present in this paper an automated method to assess the quality of Jupyter notebooks. The quality of notebooks is assessed in terms of reproducibility and executability. Specifically, we automatically extract a number of expert-defined features for each notebook, perform a feature selection step, and then trained supervised binary classifiers to predict whether a notebook is reproducible and executable, respectively. We also experimented with semantic code embeddings to capture the notebooks' semantics. We have evaluated these methods on a dataset of 306,539 notebooks and achieved an F1 score of 0.87 for reproducibility and 0.96 for executability (using expert-defined features) and an F1 score of 0.81 for reproducibility and 0.78 for executability (using code embeddings). Our results suggest that semantic code embeddings can be used to determine with good performance the reproducibility and executability of Jupyter notebooks, and since they can be automatically derived, they have the advantage of no need for expert involvement to define features.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Automated Assessment of Quality of Jupyter Notebooks Using Artificial Intelligence and Big Code

Oli

Banjade

Tamang

et al. 2021

FLAIRS

View full text Add to dashboard Cite

show abstract

“…There are many studies on the representation of source code, including recent studies proposing distributed representations for identifiers [17], APIs [46,47], and software libraries [56]. A comprehensive survey of learning the representation of source code has been done by Allamanis et al [1].…”

Section: Related Workmentioning

confidence: 99%

CC2Vec

Hoang

Kang

et al. 2020

Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

137

View full text Add to dashboard Cite

Existing work on software patches often use features specific to a single task. These works often rely on manually identified features, and human effort is required to identify these features for each task. In this work, we propose CC2Vec, a neural network model that learns a representation of code changes guided by their accompanying log messages, which represent the semantic intent of the code changes. CC2Vec models the hierarchical structure of a code change with the help of the attention mechanism and uses multiple comparison functions to identify the differences between the removed and added code. To evaluate if CC2Vec can produce a distributed representation of code changes that is general and useful for multiple tasks on software patches, we use the vectors produced by CC2Vec for three tasks: log message generation, bug fixing patch identification, and just-in-time defect prediction. In all tasks, the models using CC2Vec outperform the state-of-the-art techniques.

show abstract

“…The naturalness hypothesis of software approaches the subject in a similar way and asserts that although programming languages, in theory, are complex, flexible and powerful, the code fragments that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks [26]. For example, following the word embeddings concept in NLP, the authors of [27] generated a set of general-purpose models pre-trained over large amounts of code. Although the authors claimed that their models could be used to assist a number of information retrieval tasks, including identifying semantic errors, they did not provide any experimental results for these tasks.…”

Section: Related Workmentioning

confidence: 99%

Vulnerability Prediction From Source Code Using Machine Learning

Bilgin¹,

Ersoy²,

Soykan³

et al. 2020

IEEE Access

View full text Add to dashboard Cite

As the role of information and communication technologies gradually increases in our lives, software security becomes a major issue to provide protection against malicious attempts and to avoid ending up with noncompensable damages to the system. With the advent of data-driven techniques, there is now a growing interest in how to leverage machine learning (ML) as a software assurance method to build trustworthy software systems. In this study, we examine how to predict software vulnerabilities from source code by employing ML prior to their release. To this end, we develop a source code representation method that enables us to perform intelligent analysis on the Abstract Syntax Tree (AST) form of source code and then investigate whether ML can distinguish vulnerable and nonvulnerable code fragments. To make a comprehensive performance evaluation, we use a public dataset that contains a large amount of function-level real source code parts mined from open-source projects and carefully labeled according to the type of vulnerability if they have any. We show the effectiveness of our proposed method for vulnerability prediction from source code by carrying out exhaustive and realistic experiments under different regimes in comparison with state-of-art methods.

show abstract

Semantic Source Code Models Using Identifier Embeddings

Cited by 18 publications

References 37 publications

Automated Assessment of Quality of Jupyter Notebooks Using Artificial Intelligence and Big Code

Automated Assessment of Quality of Jupyter Notebooks Using Artificial Intelligence and Big Code

CC2Vec

Vulnerability Prediction From Source Code Using Machine Learning

Contact Info

Product

Resources

About