DéjàVu: a map of code duplicates on GitHub

Lopes, Cristina Videira; Maj, Petr; Martins, Pedro; Saini, Vaibhav; Yang, Di; Jakub, Žitný; Sajnani, Hitesh; Vítek, Jan

doi:10.1145/3133908

Cited by 147 publications

(95 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It has been shown that ile duplication among GitHub projectsÐmainly targeting popular libraries copied into many projectsÐis a common phenomenon [41], which we also observed for the projects we analyzed (see Table 1). For example, many projects contained sqlite3.c, which corresponds to the database with the same name (which uses the rdtsc instruction), SDL_endian.h for the SDL library (which uses inline assembly for endianness conversions), and inffas86.c (which implements a compression algorithm using inline assembly).…”

Section: Rq22supporting

confidence: 68%

An Analysis of x86-64 Inline Assembly in C Programs

Rigger

Marr

Kell

et al. 2018

Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

C codebases frequently embed nonportable and unstandardized elements such as inline assembly code. Such elements are not well understood, which poses a problem to tool developers who aspire to support C code. This paper investigates the use of x86-64 inline assembly in 1264 C projects from GitHub and combines qualitative and quantitative analyses to answer questions that tool authors may have. We found that 28.1% of the most popular projects contain inline assembly code, although the majority contain only a few fragments with just one or two instructions. The most popular instructions constitute a small subset concerned largely with multicore semantics, performance optimization, and hardware control. Our indings are intended to help developers of C-focused tools, those testing compilers, and language designers seeking to reduce the reliance on inline assembly. They may also aid the design of tools focused on inline assembly itself.

show abstract

Section: Rq22supporting

confidence: 68%

An Analysis of x86-64 Inline Assembly in C Programs

Rigger

Marr

Kell

et al. 2018

Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

show abstract

“…To filter duplicates, we used file names, directory names (such as "node_modules"), and md5 of files. In Java and Python, which do not commit dependencies, duplication is less severe (as also observed by Lopes et al [29]). Furthermore, in our setting, we took the top-ranked and most popular projects, in which we observed duplication to be less of a problem (Lopes et al [29] measured duplication across all the code in GitHub).…”

Section: Experimental Settingmentioning

confidence: 52%

“…Java required an order of magnitude more data than the other languages: we had to keep enlarging our Java dataset to achieve results that were close to the other languages. Following recent work which found a large amount of code duplication in GitHub [29], we devoted much effort into filtering duplicates from our dataset, and especially the JavaScript dataset. To filter duplicates, we used file names, directory names (such as "node_modules"), and md5 of files.…”

Section: Experimental Settingmentioning

confidence: 99%

A general path-based representation for predicting program properties

Alon

Zilberstein

Levy

et al. 2018

Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

117

139

View full text Add to dashboard Cite

Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming, and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a way that facilitates effective learning.We present a general path-based representation for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens.We show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages.We evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages.

show abstract

“…We trained models at all three levels of import granularity (i.e., class, package and library-level) and could clearly observe that, for the task of clustering libraries based on semantic similarity, package-and library-level imports produced better non-trivial clusters 5 . In this work we therefore focus on library-level imports.…”

Section: B Assessing the Quality Of The Trained Modelsmentioning

confidence: 98%

Import2vec: Learning Embeddings for Software Libraries

Theeten

Vandeputte

Cutsem

2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

View full text Add to dashboard Cite

We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning.We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages ("library vectors"). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).

show abstract

DéjàVu: a map of code duplicates on GitHub

Cited by 147 publications

References 25 publications

An Analysis of x86-64 Inline Assembly in C Programs

An Analysis of x86-64 Inline Assembly in C Programs

A general path-based representation for predicting program properties

Import2vec: Learning Embeddings for Software Libraries

Contact Info

Product

Resources

About