Context: Due to the complex nature of the software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques. Objective: This study aims for a systematic assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation. Method: In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. Results: Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. Conclusion: In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation.
International audienceThe current and future developments of electric power systems are pushing the boundaries of reliability assessment to consider distribution networks with renewable generators. Given the stochastic features of these elements, most modeling approaches rely on Monte Carlo simulation. The computational costs associated to the simulation approach force to treating mostly small-sized systems, i.e. with a limited number of lumped components of a given renewable technology (e.g. wind or solar, etc.) whose behavior is described by a binary state, working or failed. In this paper, we propose an analytical multi-state modeling approach for the reliability assessment of distributed generation (DG). The approach allows looking to a number of diverse energy generation technologies distributed on the system. Multiple states are used to describe the randomness in the generation units, due to the stochastic nature of the generation sources and of the mechanical degradation/failure behavior of the generation systems. The universal generating function (UGF) technique is used for the individual component multi-state modeling. A multiplication-type composition operator is introduced to combine the UGFs for the mechanical degradation and renewable generation source states into the UGF of the renewable generator power output. The overall multi-state DG system UGF is then constructed and classical reliability indices (e.g. loss of load expectation (LOLE), expected energy not supplied (EENS)) are computed from the DG system generation and load UGFs. An application of the model is shown on a DG system adapted from the IEEE 34 nodes distribution test feeder
Cost estimation is one of the most important but most difficult tasks in software project management. Many methods have been proposed for software cost estimation. Analogy Based Estimation (ABE), which is essentially a case-based reasoning (CBR) approach, is one popular technique. To improve the accuracy of ABE method, several studies have been focusing on the adjustments to the original solutions. However, most published adjustment mechanisms are based on linear forms and are restricted to numerical type of project features. On the other hand, software project datasets often exhibit nonnormal characteristics with large proportions of categorical features. To explore the possibilities for a better adjustment mechanism, this paper proposes Artificial Neural Network (ANN) for Non-linear adjustment to ABE (NABE) with the learning ability to approximate complex relationships and incorporating the categorical features. The proposed NABE is validated on four real world datasets and compared against the linear adjusted ABEs, CART, ANN and SWR. Subsequently, eight artificial datasets are generated for a systematic investigation on the relationship between model accuracies and dataset properties. The comparisons and analysis show that non-linear adjustment could generally extend ABE's flexibility on complex datasets with large number of categorical features and improve the accuracies of adjustment techniques.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.