Abstract-Accurate project effort prediction is an important goal for the software engineering community. To date most work has focused upon building algorithmic models of effort, for example COCOMO. These can be calibrated to local environments. We describe an alternative approach to estimation based upon the use of analogies. The underlying principle is to characterize projects in terms of features (for example, the number of interfaces, the development method or the size of the functional requirements document). Completed projects are stored and then the problem becomes one of finding the most similar projects to the one for which a prediction is required. Similarity is defined as Euclidean distance in n-dimensional space where n is the number of project features. Each dimension is standardized so all dimensions have equal weight. The known effort values of the nearest neighbors to the new project are then used as the basis for the prediction. The process is automated using a PC-based tool known as ANGEL. The method is validated on nine different industrial datasets (a total of 275 projects) and in all cases analogy outperforms algorithmic models based upon stepwise regression. From this work we argue that estimation by analogy is a viable technique that, at the very least, can be used by project managers to complement current estimation techniques.
This paper aims to provide a basis for the improvement of software estimation research through a systematic review of previous work. The review identifies 304 software cost estimation papers in 76 journals and classifies the papers according to research topic, estimation approach, research approach, study context and data set. Based on the review, we provide recommendations for future software cost estimation research: 1) Increase the breadth of the search for relevant studies, 2) Search manually for relevant papers within a carefully selected set of journals when completeness is essential, 3) Conduct more research on basic software cost estimation topics, 4) Conduct more studies of software cost estimation in real-life settings, 5) Conduct more studies on estimation methods commonly used by the software industry, and, 6) Conduct fewer studies that evaluate methods based on arbitrarily chosen data sets.
BACKGROUND -self evidently empirical analyses rely upon the quality of their data. Likewise replications rely upon accurate reporting and using the same rather than similar versions of data sets. In recent years there has been much interest in using machine learners to classify software modules into defectprone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research.OBJECTIVE -this short note investigates the extent to which published analyses based on the NASA defect data sets are meaningful and comparable.
METHOD -we analyse the five studies published in IEEE Transactions on Software Engineering since2007 that have utilised these data sets and compare the two versions of the data sets currently in use.RESULTS -we find important differences between the two versions of the data sets, implausible values in one data set and generally insufficient detail documented on data set pre-processing.CONCLUSIONS -it is recommended that researchers (i) indicate the provenance of the data sets they use (ii) report any pre-processing in sufficient detail to enable meaningful replication and (iii) invest effort in understanding the data prior to applying machine learners.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.