a b s t r a c tText categorization plays a crucial role in both academic and commercial platforms due to the growing demand for automatic organization of documents. Kernel-based classification algorithms such as Support Vector Machines (SVM) have become highly popular in the task of text mining. This is mainly due to their relatively high classification accuracy on several application domains as well as their ability to handle high dimensional and sparse data which is the prohibitive characteristics of textual data representation. Recently, there is an increased interest in the exploitation of background knowledge such as ontologies and corpus-based statistical knowledge in text categorization. It has been shown that, by replacing the standard kernel functions such as linear kernel with customized kernel functions which take advantage of this background knowledge, it is possible to increase the performance of SVM in the text classification domain. Based on this, we propose a novel semantic smoothing kernel for SVM. The suggested approach is based on a meaning measure, which calculates the meaningfulness of the terms in the context of classes. The documents vectors are smoothed based on these meaning values of the terms in the context of classes. Since we efficiently make use of the class information in the smoothing process, it can be considered a supervised smoothing kernel. The meaning measure is based on the Helmholtz principle from Gestalt theory and has previously been applied to several text mining applications such as document summarization and feature extraction. However, to the best of our knowledge, ours is the first study to use meaning measure in a supervised setting to build a semantic kernel for SVM. We evaluated the proposed approach by conducting a large number of experiments on well-known textual datasets and present results with respect to different experimental conditions. We compare our results with traditional kernels used in SVM such as linear kernel as well as with several corpus-based semantic kernels. Our results show that classification performance of the proposed approach outperforms other kernels.
Ganiz, Murat Can (Dogus Author) -- Conference full title: 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2014) : Alberobello, Italy, 23-25 June 2014.The bag of words (BOW) representation of documents is very common in text classification systems. However, the BOW approach ignores the position of the words in the document and more importantly, the semantic relations between the words. In this study, we present a simple semantic kernel for Support Vector Machines (SVM) algorithm. This kernel uses higher-order relations between terms in order to incorporate semantic information into the SVM. This is an easy to implement algorithm which forms a basis for future improvements. We perform a serious of experiments on different well known textual datasets. Experiment results show that classification performance improves over the traditional kernels used in SVM such as linear kernel which is commonly used in text classification
Çakırman, Erhan (Dogus Author) -- Ganiz, Murat C. (Dogus Author) -- Akyokuş, Selim (Dogus Author) -- Gürbüz, Mustafa Z. (Dogus Author) -- Conference full title: 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2011) Istanbul, Turkey, 15 - 18 June 2011Preprocessing is an important task and critical step in information retrieval and text mining. The objective of this study is to analyze the effect of preprocessing methods in text classification on Turkish texts. We compiled two large datasets from Turkish newspapers using a crawler. On these compiled data sets and using two additional datasets, we perform a detailed analysis of preprocessing methods such as stemming, stopword filtering and word weighting for Turkish text classification on several different Turkish datasets. We report the results of extensive experiments.TUBITAK, IEE
Ganiz, Murat Can (Dogus Author) -- Conference full title: 2013 10th International Conference on Electronics, Computer and Computation, ICECCO 2013; Ankara; Turkey; 7 November 2013 through 8 November 2013.In conventional text categorization algorithms, documents are symbolized as “bag of words” (BOW) with the fact that documents are supposed to be independent from each other. While this approach simplifies the models, it ignores the semantic information between terms of each document. In this study, we develop a novel method to measure semantic similarity based on higher-order dependencies between documents. We propose a kernel for Support Vector Machines (SVM) algorithm using these dependencies which is called Higher-Order Semantic Kernel. With the aim of presenting comparative performance of Higher-Order Semantic Kernel we performed many experiments not only with our algorithm but also with existing traditional first-order kernels such as Polynomial Kernel, Radial Basis Function Kernel, and Linear Kernel. The experiments using Higher-Order Semantic Kernel on several well-known datasets show that classification performance improves significantly over the first-order methods
Traditional machine learning methods only consider relationships between feature values within individual data instances while disregarding the dependencies that link features across instances. In this work, we develop a general approach to supervised learning by leveraging higher-order dependencies between features. We introduce a novel Bayesian framework for classification named Higher Order Naive Bayes (HONB). Unlike approaches that assume data instances are independent, HONB leverages co-occurrence relations between feature values across different instances. Additionally, we generalize our framework by developing a novel data-driven space transformation that allows any classifier operating in vector spaces to take advantage of these higher-order cooccurrence relations. Results obtained on several benchmark text corpora demonstrate that higher-order approaches achieve significant improvements in classification accuracy over the baseline (first-order) methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.