Data plays a key role in the design of expert and intelligent systems and therefore, data preprocessing appears to be a critical step to produce high-quality data and build accurate machine learning models. Over the past decades, increasing attention has been paid towards the issue of class imbalance and this is now a research hotspot in a variety of fields. Although the resampling methods, either by undersampling the majority class or by over-sampling the minority class, stand among the most powerful techniques to face this problem, their strengths and weaknesses have typically been discussed based only on the class imbalance ratio. However, several questions remain open and need further exploration. For instance, the subtle differences in performance between the over-and under-sampling algorithms are still under-comprehended, and we hypothesize that they could be better explained by analyzing the inner structure of the data sets. Consequently, this paper attempts to investigate and illustrate the effects of the resampling methods on the inner structure of a data set by exploiting local neighborhood information, identifying the sample types in both classes and analyzing their distribution in each resampled set. Experimental results indicate that the resampling methods that pro
‘El Diario de Juárez’ is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to the news provided by the RSS service in order to classify them based on whether they are about a traffic incident or not, with the final intention of notifying citizens where such accidents occur. The classification process explores the bag-of-words technique with five learners (Classification and Regression Tree (CART), Naïve Bayes, kNN, Random Forest, and Support Vector Machine (SVM)) on a class-imbalanced benchmark; this challenging issue is dealt with via five sampling algorithms: synthetic minority oversampling technique (SMOTE), borderline SMOTE, adaptive synthetic sampling, random oversampling, and random undersampling. Consequently, our final classifier reaches a sensitivity of 0.86 and an area under the precision-recall curve of 0.86, which is an acceptable performance when considering the complexity of analyzing unstructured texts in Spanish.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.