Building Program Vector Representations for Deep Learning

Peng, Hao; Mou, Lili; Li, Ge; Li, Yuxuan; Zhang, Lu; Jin, Zhi

doi:10.1007/978-3-319-25159-2_49

Cited by 99 publications

(55 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CLCMiner is based on revision histories; it is limited to detect cross-language clones that have been changed in the past in the same project. For clones that are never changed, we can explore more language attributes that can identify clone relations (e.g., using deep learning to build vector representation of programs [21]) across languages. We also believe this limitation can be compensated by a single-language detector that can detect cross-project and same-language clones based on certain clone transitivity across projects and languages.…”

Section: Discussion and Future Workmentioning

confidence: 99%

<i>CLCMiner</i>: Detecting Cross-Language Clones without Intermediates

Cheng

Peng

Jiang

et al. 2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe proliferation of diverse kinds of programming languages and platforms makes it a common need to have the same functionality implemented in different languages for different platforms, such as Java for Android applications and C# for Windows phone applications. Although versions of code written in different languages appear syntactically quite different from each other, they are intended to implement the same software and typically contain many code snippets that implement similar functionalities, which we call cross-language clones. When the version of code in one language evolves according to changing functionality requirements and/or bug fixes, its cross-language clones may also need be changed to maintain consistent implementations for the same functionality. Thus, it is needed to have automated ways to locate and track cross-language clones within the evolving software. In the literature, approaches for detecting cross-language clones are only for languages that share a common intermediate language (such as the .NET language family) because they are built on techniques for detecting single-language clones. To extend the capability of cross-language clone detection to more diverse kinds of languages, we propose a novel automated approach, CLCMiner, without the need of an intermediate language. It mines such clones from revision histories, based on our assumption that revisions to different versions of code implemented in different languages may naturally reflect how programmers change cross-language clones in practice, and that similarities among the revisions (referred to as clones in diffs or diff clones) may indicate actual similar code. We have implemented a prototype and applied it to ten open source projects implementations in both Java and C#. The reported clones that occur in revision histories are of high precisions (89% on average) and recalls (95% on average). Compared with token-based code clone detection tools that can treat code as plain texts, our tool can detect significantly more cross-language clones. All the evaluation results demonstrate the feasibility of revision-history based techniques for detecting cross-language clones without intermediates and point to promising future work.

show abstract

Section: Discussion and Future Workmentioning

confidence: 99%

<i>CLCMiner</i>: Detecting Cross-Language Clones without Intermediates

Cheng

Peng

Jiang

et al. 2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…For each subtree, with parent node p and n child nodes {c i } 1≤i ≤n , define l i = (# leaves of c i )/(# leaves of p). Similar to [8], we define a loss function to measure how well the learnt vectors are describing the subtrees. Let T be the number of distinct AST types whose embeddings we are trying to learn.…”

Section: Methodsmentioning

confidence: 99%

AST-Based Deep Learning for Detecting Malicious PowerShell

Rusak

Al-Dujaili

O’Reilly

2018

Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security

View full text Add to dashboard Cite

With the celebrated success of deep learning, some attempts to develop effective methods for detecting malicious PowerShell programs employ neural nets in a traditional natural language processing setup while others employ convolutional neural nets to detect obfuscated malicious commands at a character level. While these representations may express salient PowerShell properties, our hypothesis is that tools from static program analysis will be more effective. We propose a hybrid approach combining traditional program analysis (in the form of abstract syntax trees) and deep learning. This poster presents preliminary results of a fundamental step in our approach: learning embeddings for nodes of PowerShell ASTs. We classify malicious scripts by family type and explore embedded program vector representations. CCS CONCEPTS• Security and privacy → Malware and its mitigation; • Computing methodologies → Neural networks; KEYWORDS powershell scripts; malware; deep learning; abstract syntax trees ACM Reference Format:

show abstract

“…To achieve this goal, we mine code fragments where violations are localized and identify common patterns, not only in fixed violations but also in unfixed violations. Before describing our approach of mining common code patterns, we formalize the definition of a code pattern, and provide justifications for the techniques selected in the approach (namely CNNs [18], [31], [32] and X-means clustering algorithm [19]).…”

Section: Mining Common Code Patternsmentioning

confidence: 99%

Mining Fix Patterns for FindBugs Violations

Liu

Kim

Bissyandé

et al. 2021

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Several static analysis tools, such as Splint or FindBugs, have been proposed to the software development community to help detect security vulnerabilities or bad programming practices. However, the adoption of these tools is hindered by their high false positive rates. If the false positive rate is too high, developers may get acclimated to violation reports from these tools, causing concrete and severe bugs being overlooked. Fortunately, some violations are actually addressed and resolved by developers. We claim that those violations that are recurrently fixed are likely to be true positives, and an automated approach can learn to repair similar unseen violations. However, there is lack of a systematic way to investigate the distributions on existing violations and fixed ones in the wild, that can provide insights into prioritizing violations for developers, and an effective way to mine code and fix patterns which can help developers easily understand the reasons of leading violations and how to fix them.In this paper, we first collect and track a large number of fixed and unfixed violations across revisions of software. The empirical analyses reveal that there are discrepancies in the distributions of violations that are detected and those that are fixed, in terms of occurrences, spread and categories, which can provide insights into prioritizing violations. To automatically identify patterns in violations and their fixes, we propose an approach that utilizes convolutional neural networks to learn features and clustering to regroup similar instances. We then evaluate the usefulness of the identified fix patterns by applying them to unfixed violations. The results show that developers will accept and merge a majority (69/116) of fixes generated from the inferred fix patterns. It is also noteworthy that the yielded patterns are applicable to four real bugs in the Defects4J major benchmark for software testing and automated repair.

show abstract

Building Program Vector Representations for Deep Learning

Cited by 99 publications

References 35 publications

<i>CLCMiner</i>: Detecting Cross-Language Clones without Intermediates

<i>CLCMiner</i>: Detecting Cross-Language Clones without Intermediates

AST-Based Deep Learning for Detecting Malicious PowerShell

Mining Fix Patterns for FindBugs Violations

Contact Info

Product

Resources

About