Using ambiguity measure feature selection algorithm for support vector machine classifier

Mengle, Saket S. R.; Goharian, Nazli

doi:10.1145/1363686.1363896

Cited by 11 publications

(7 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many well‐known feature‐selection algorithms are used with SVM to improve its accuracy and efficiency. We use the AM feature‐selection method as a preprocessing step for the support vector machine classifier (Mengle & Goharian, 2008). The features whose AM scores are below a given threshold, i.e., more ambiguous terms, are purged while the features whose AM scores are above a given threshold are used for the SVM learning phase.…”

Section: Introductionmentioning

confidence: 99%

Ambiguity measure feature‐selection algorithm

Mengle

Goharian

2009

J. Am. Soc. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

With the increasing number of digital documents, the ability to automatically classify those documents both efficiently and accurately is becoming more critical and difficult. One of the major problems in text classification is the high dimensionality of feature space. We present the ambiguity measure (AM) feature-selection algorithm, which selects the most unambiguous features from the feature set. Unambiguous features are those features whose presence in a document indicate a strong degree of confidence that a document belongs to only one specific category. We apply AM feature selection on a naïve Bayes text classifier. We favorably show the effectiveness of our approach in outperforming eight existing featureselection methods, using five benchmark datasets with a statistical significance of at least 95% confidence. The support vector machine (SVM) text classifier is shown to perform consistently better than the naïve Bayes text classifier. The drawback, however, is the time complexity in training a model. We further explore the effect of using the AM feature-selection method on an SVM text classifier. Our results indicate that the training time for the SVM algorithm can be reduced by more than 50%, while still improving the accuracy of the text classifier. We favorably show the effectiveness of our approach by demonstrating that it statistically significantly (99% confidence) outperforms eight existing feature-selection methods using four standard benchmark datasets.

show abstract

Section: Introductionmentioning

confidence: 99%

Ambiguity measure feature‐selection algorithm

Mengle

Goharian

2009

J. Am. Soc. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

show abstract

“…By gradient descent method, we can iteratively update the centroid feature vectors stochastically to get the best centroid by formula (16) and (17) as follows:…”

Section: Smoothing Listwise Ranking Centroid Methodsmentioning

confidence: 99%

A Framework of Centroid-Based Methods for Text Categorization

Wang

Chen

Wang

2014

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYText Categorization (TC) is a task of classifying a set of documents into one or more predefined categories. Centroid-based method, a very popular TC method, aims to make classifiers simple and efficient by constructing one prototype vector for each class. It classifies a document into the class that owns the prototype vector nearest to the document. Many studies have been done on constructing prototype vectors. However, the basic philosophies of these methods are quite different from each other. It makes the comparison and selection of centroid-based TC methods very difficult. It also makes the further development of centroid-based TC methods more challenging. In this paper, based on the observation of its general procedure, the centroid-based text classification is treated as a kind of ranking task, and a unified framework for centroid-based TC methods is proposed. The goal of this unified framework is to classify a text via ranking all possible classes by document-class similarities. Prototype vectors are constructed based on various loss functions for ranking classes. Under this framework, three popular centroid-based methods: Rocchio, Hypothesis Margin Centroid and DragPushing are unified and their details are discussed. A novel centroid-based TC method called SLRCM that uses a smoothing ranking loss function is further proposed. Experiments conducted on several standard databases show that the proposed SLRCM method outperforms the compared centroid-based methods and reaches the same performance as the state-of-the-art TC methods. key words: text categorization, centroid-based methods, smoothing listwise ranking centroid method, a unified framework

show abstract

“…Furthermore, the training of the naïve Bayes classifier is in linear time, unlike in SVM. We improved the effectiveness of the model by using two feature‐selection algorithms, namely odds ratio (Mladenić & Grobelnik, 1998) and ambiguity measure (AM), which was shown to outperform the existing feature‐selection algorithms (Mengle & Goharian, 2008b). We evaluated the effectiveness of these feature‐selection algorithms on unbalanced datasets and observed that AM is better suited for such tasks.…”

Section: Methodsmentioning

confidence: 99%

“…Ambiguity measure (AM; Mengle & Goharian, 2008b) assigns a high score to a term if it appears consistently in only one specific category. The AM for a term t k with respect to category c i is calculated using Equation 2.…”

Section: Methodsmentioning

confidence: 99%

Passage detection using text classification

Mengle¹,

Goharian²

2009

J. Am. Soc. Inf. Sci.

Self Cite

View full text Add to dashboard Cite

Passages can be hidden within a text to circumvent their disallowed transfer. Such release of compartmentalized information is of concern to all corporate and governmental organizations. Passage retrieval is well studied; we posit, however, that passage detection is not. Passage retrieval is the determination of the degree of relevance of blocks of text, namely passages, comprising a document. Rather than determining the relevance of a document in its entirety, passage retrieval determines the relevance of the individual passages. As such, modified traditional information-retrieval techniques compare terms found in user queries with the individual passages to determine a similarity score for passages of interest. In passage detection, passages are classified into predetermined categories. More often than not, passage detection techniques are deployed to detect hidden paragraphs in documents. That is, to hide information, documents are injected with hidden text into passages. Rather than matching query terms against passages to determine their relevance, using text-mining techniques, the passages are classified. Those documents with hidden passages are defined as infected. Thus, simply stated, passage retrieval is the search for passages relevant to a user query, while passage detection is the classification of passages. That is, in passage detection, passages are labeled with one or more categories from a set of predetermined categories. We present a keyword-based dynamic passage approach (KDP) and demonstrate that KDP outperforms statistically significantly (99% confidence) the other document-splitting approaches by 12% to 18% in the passage detection and passage category-prediction tasks. Furthermore, we evaluate the effects of the feature selection, passage length, ambiguous passages, and finally training-data category distribution on passage-detection accuracy.

show abstract

Using ambiguity measure feature selection algorithm for support vector machine classifier

Cited by 11 publications

References 15 publications

Ambiguity measure feature‐selection algorithm

Ambiguity measure feature‐selection algorithm

A Framework of Centroid-Based Methods for Text Categorization

Passage detection using text classification

Contact Info

Product

Resources

About