Abstract:The text classification problem for natural language call routing was considered in the paper. Seven different term weighting methods were applied. As dimensionality reduction methods, the feature selection based on self-adaptive GA is considered. k-NN, linear SVM and ANN were used as classification algorithms. The tasks of the research are the following: perform research of text classification for natural language call routing with different term weighting methods and classification algorithms and investigate the feature selection method based on self-adaptive GA. The numerical results showed that the most effective term weighting is TRR. The most effective classification algorithm is ANN. Feature selection with self-adaptive GA provides improvement of classification effectiveness and significant dimensionality reduction with all term weighting methods and with all classification algorithms.
IntroductionNatural language call routing is an important problem in the design of modern automatic call services and the solving of this problem could lead to improvement of the call service [21]. Generally natural language call routing can be considered as two different problems. The first one is speech recognition of calls and the second one is topic categorization of users utterances for further routing. Topic categorization of users utterances can be also useful for multidomain spoken dialogue system design [12]. In this work we treat call routing as an example of a text classification application.In the vector space model [16] text classification is considered as a machine learning problem. The complexity of text categorization with a vector space model is compounded by the need to extract the numerical data from text information before applying machine learning algorithms. Therefore, text classification consists of two parts: text preprocessing and classification algorithm application using the obtained numerical data. Text preprocessing comprises three stages:-Textual feature extraction.-Term weighting -Dimensionality reduction. The first one is the textual feature extraction based on raw preprocessing of the documents. This process includes deleting punctuation, transforming capital letters to lowercase, and additional procedures such as stop-words filtering [4] and stemming [14]. Stop-words list contains pronouns, prepositions, articles and other words that usually have no importance for the classification. Using stemming it is possible to join different forms of the same word into one textual feature.The second stage is the numerical feature extraction based on term weighting. For term weighting we use "bag-of-words" model, in which the word order is ignored. There exist different unsupervised and supervised term weighting methods. The most well-known unsupervised term weighting method is TFIDF [15]. The following supervised term weighting methods are also considered in the paper: