The performance of classification models can be negatively impacted if the data on which they are trained contains very rare events. While recent research has investigated the issue of class imbalance, few if any studies address issues related to the handling of extreme imbalance (rare events), where the minority class can account for as little as 0.1% of the training data. This work investigates the effect of dataset size and class distribution on classification performance when examples from the minority class are rare. In addition, we compare the performance improvement achieved by acquiring additional examples to that of applying data sampling. Our results demonstrate that data sampling is very effective at alleviating the problem of rare events.
The dynamic decay adjustment (DDA) algorithm is a fast constructive algorithm for training RBF neural networks (RBFNs) and probabilistic neural networks (PNNs). The algorithm has two parameters, namely, theta(+) and theta(-). The papers which introduced DDA argued that those parameters would not heavily influence classification performance and therefore they recommended using always the default values of these parameters. In contrast, this paper shows that smaller values of parameter theta(-) can, for a considerable number of datasets, result in strong improvement in generalization performance. The experiments described here were carried out using twenty benchmark classification datasets from both Proben1 and the UCI repositories. The results show that for eleven of the datasets, the parameter theta(-) strongly influenced classification performance. The influence of theta(-) was also noticeable, although much less, on six of the datasets considered. This paper also compares the performance of RBF-DDA with theta(-) selection with both AdaBoost and Support Vector Machines (SVMs).
Software reuse is essential for improving the productivity and quality of software projects. One of the key issues to promote the adoption of software reuse in companies is the development of effective repositories of software components. It is also very important to have good methods for searching and retrieval of the components. Clustering techniques can help by providing a visualization of the repository of software components as well as in helping to refine the searches by grouping together similar components. In this paper we quantitatively compare two clustering techniques, namely, self-organizing maps (SOM) and growing hierarquical SOM (GHSOM) for clustering a repository of classes from a Java API for building mobile systems. The performance measure was the quantization error. The simulations have shown that GHSOM outperforms SOM in these tasks. GHSOM is more suitable for this task because it is a constructive technique, which is an advantage in tackling the growth of the repository of software components.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.