SIEVE: Helping developers sift wheat from chaff via cross-platform analysis

Sulistya, Agus; Prana, Gede Artha Azriadi; Sharma, Abhishek; Lo, David

doi:10.1007/s10664-019-09775-w

Cited by 6 publications

(4 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are no statistically significant differences among FastText and Word2vec models in terms of MCC. Our results confirm previous work (Sulistya et al 2020 ; Mikolov et al 2013 ) which assessed the superiority of Word2vec and FastText in a different context (i.e., text mining). In addition, our work agrees with the findings of previous work (Lau and Baldwin 2016 ) suggesting that Doc2vec creates document embeddings which align with lower frequency words when the documents are short and the corpus is relatively small.…”

Section: Results Of the Empirical Studysupporting

confidence: 92%

“…In a recent study, Sulistya et al ( 2020 ) compared different word embedding learning methods for finding software-relevant tweets. Following their guidelines, we used the same hyper-parameter settings for each word embedding learning model (i.e., Word2vec , Doc2Vec , and FastText ).…”

Section: Empirical Study Definition and Designmentioning

confidence: 99%

“…To answer RQ2, we chose three widely used word embedding learning models, Word2vec , Doc2vec (Le and Mikolov 2014), and FastText (Joulin et al 2016). These embedding models are used by software engineering research for learning representations source codes and method names (Pradel and Sen 2018;Liu and et al 2019;Fakhoury et al 2018), and other natural language texts (Sulistya et al 2020). Word2vec is a two-layer neural network that processes text by creating vector representations from words.…”

Section: Word Embedding Selectionmentioning

confidence: 99%

“…Boosted by the emerging trend of learning-based approaches and word embedding in the software engineering research (Sulistya et al 2020;Pradel and Sen 2018;Liu and et al 2019;Omri and Sinz 2020;, we propose FINDICI, a novel approach to detect linguistic anti-patterns in IaC, focusing on name-body inconsistencies in IaC code units. We formulate name-body inconsistency detection as a binary classification problem and train a classifier that distinguishes between consistent and inconsistent code units.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

FindICI: Using machine learning to detect linguistic inconsistencies between code and natural language descriptions in infrastructure-as-code

et al. 2022

View full text Add to dashboard Cite

Linguistic anti-patterns are recurring poor practices concerning inconsistencies in the naming, documentation, and implementation of an entity. They impede the readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in Infrastructure-as-Code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their short text names. To this end, we propose FindICI a novel automated approach that employs word embedding and classification algorithms. We build and use the abstract syntax tree of IaC code units to create code embeddings used by machine learning techniques to detect inconsistent IaC code units. We evaluated our approach with two experiments on Ansible tasks systematically extracted from open source repositories for various word embedding models and classification algorithms. Classical machine learning models and novel deep learning models with different word embedding methods showed comparable and satisfactory results in detecting inconsistent Ansible tasks related to the top-10 used Ansible modules.

show abstract

Section: Results Of the Empirical Studysupporting

confidence: 92%

Section: Empirical Study Definition and Designmentioning

confidence: 99%