Abstract. This paper describes several known and some new methods for feature subset selection on large text data. Experimental comparison given on real-world data collected from Web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. Our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. In our learning experiments naive Bayesian classifier was used on text data. The best performance was achieved by the feature selection methods based on the feature scoring measure called Odds ratio that is known from information retrieval.
I n t r o d u c t i o nIn propositional learning problem domain is given by a set of examples, where each example is described with a class value and a vector of feature values. Features used to describe examples are not necessary all relevant and beneficial for the inductive learning and may reduce quality of induced model. Additionally, a high number of features may slow down the induction process while giving similar results as obtained with much smaller feature subset. Section 2 describes approach commonly used for feature subset selection in learning on text data (text-learning). In Section 4 we experimentally compare different feature scoring measures on real-world data collected from Web users. Section 3 describes our problem domain and naive Bayesian classifier for text that we used in experiments. Discussion is given in Section 5.
2F e a t u r e s u b s e t s e l e c t i o n approaches Different methods have been developed and used for feature subset selection in statistics, p~ttern recognition and machine learning, using different search strategies and evaluation functions. John et al. [4] pointed out the difference between the two main approaches used in machine learning to feature subset selection: filtering approach where the feature subset is selected independent of the learning method and wrapper approach where the feature subset is selected using the same learning algorithm that will be used for learning on domain represented with the selected feature subset.
In this study a thorough analysis is conducted concerning the prediction of groundwater levels of Ljubljana polje aquifer. Machine learning methodologies are implemented using strongly correlated physical parameters as input variables. The results show that data-driven modelling approaches can perform sufficiently well in predicting groundwater level changes. Different evaluation metrics confirm and highlight the capability of these models to catch the trend of groundwater level fluctuations. Despite the overall adequate performance, further investigation is needed towards improving their accuracy in order to be comprised in decision making processes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.