Automated change-prone class prediction on unlabeled dataset using unsupervised method

Yan, Meng; Zhang, Xiaohong; Liu, Chao; Xu, Lan; Yang, Mengning; Yang, Dan

doi:10.1016/j.infsof.2017.07.003

Cited by 21 publications

(17 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To achieve this goal, we have constructed a defect prediction model by exploiting the unlabelled software datasets of Geant4 that is one of the most rigorously validated software packages for the simulation of the passage of particles through matter [18]. Amongst the different ML methodologies, we have selected CLAMI [13] and CLAMI+ [14] in order to label the instances in the software datasets. In addition, we have applied a large set of ML techniques to predict defect-prone modules.…”

Section: Methodsmentioning

confidence: 99%

“…More in detail, CLAMI+ transforms the Boolean representation in CLAMI of metrics' violation into a probabilistic value based on the difference between the metric value and the threshold. Consequently, CLAMI+ considers how much an instance violated on a metric and leads to a different selection of the final training set that is expected to be more informative than that built by CLAMI [14].…”

Section: Methodsmentioning

confidence: 99%

“…CLAMI+ is an evolution of the CLAMI approach: it employs a different procedure in the metrics' selection phase. CLAMI+ is still dependent on thresholds, but it normalizes metrics' values [14].…”

mentioning

confidence: 99%

See 2 more Smart Citations

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

et al. 2020

View full text Add to dashboard Cite

Software defect prediction is an activity that aims at narrowing down the most likely defect-prone software modules and helping developers and testers to prioritize inspection and testing. This activity can be addressed by using Machine Learning techniques applied to software metrics datasets that are usually unlabelled, i.e. they lack modules classification in terms of defectiveness. To overcome this limitation, in addition to the usual data pre-processing operations to manage mission values and/or to remove inconsistencies, researches have to adopt an approach to label their unlabelled software datasets. The extraction of defectiveness data to label all the instances of the datasets is an extremely time and effort consuming operation. In literature, many studies have introduced approaches to build a defect prediction models on unlabelled datasets. In this paper, we describe the analysis of new unlabelled datasets from WLCG software, coming from HEP-related experiments and middleware, by using Machine Learning techniques. We have experimented new approaches to label the various modules due to the heterogeneity of software metrics distribution. We discuss a number of lessons learned from conducting these activities, what has worked, what has not and how our research can be improved.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

et al. 2020

View full text Add to dashboard Cite

show abstract

“…A supervised technique uses an already labelled dataset to train a classification algorithm. In an unsupervised approach, a dataset is labelled using certain heuristics such as distance measures to cluster related texts (Yan et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

An improved text classification modelling approach to identify security messages in heterogeneous projects

Oyetoyan

Morrison

2021

Software Qual J

View full text Add to dashboard Cite

Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software’s design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26–44% in recall, 22–50% in g-measure, 0.4–28% in f-score, and 15–19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.

show abstract

“…The popularity of DL models in SE is mainly due to the advantages of representation learning from raw data [79,95,134]. For example, in many recent SE studies, a large number of challenges derive from the semantic comprehension of code in programming languages [84,85,148,149,154], text in natural languages [34,150], or their mutual transformation [44]. As code and text involves some form of natural language processing (NLP), it commonly starts with encoding words by a fixed size of vocabulary [44].…”

Section: Background and Related Work 21 DL Technology In Sementioning

confidence: 99%

On the Replicability and Reproducibility of Deep Learning in Software Engineering

Liu,

Gao,

Xia

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) replicability -whether the reported experimental result can be approximately reproduced in high probability with the same DL model and the same data; and (2) reproducibility -whether one reported experimental findings can be reproduced by new experiments with the same experimental protocol and DL model, but different sampled real-world data. Unlike traditional machine learning (ML) models, DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process. In this study, we conducted a literature review on 93 DL studies recently published in twenty SE journals or conferences. Our statistics show the urgency of investigating these two factors in SE, where only 10.8% of the studies discussed any research questions affecting replicability and/or reproducibility. More than 74.2% of the studies do not even share source code and data to support the replicability of their complex models. Moreover, we re-ran four representative DL models in SE. Experimental results show the importance of replicability and reproducibility, where the reported performance of a DL model could not be replicated for an unstable optimization process. Reproducibility could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. It is therefore urgent for the SE community to provide a long-lasting link to a replication package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.CCS Concepts: • Software and its engineering → Software maintenance tools.

show abstract

Automated change-prone class prediction on unlabeled dataset using unsupervised method

Cited by 21 publications

References 24 publications

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

Lessons Learned from the Assessment of Software Defect Prediction on WLCG Software: A Study with Unlabelled Datasets and Machine Learning Techniques

An improved text classification modelling approach to identify security messages in heterogeneous projects

On the Replicability and Reproducibility of Deep Learning in Software Engineering

Contact Info

Product

Resources

About