Naive Bayes Approach for Website Classification

Rajalakshmi, R.; Aravindan, Chandrabose

doi:10.1007/978-3-642-20573-6_55

Cited by 29 publications

(16 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this study, a Quick Reduct algorithm was used for dimensionality reduction and information gain was used for feature selection. The study concluded that this approach would improve the accuracy and efficiency of the classifier Rajalakshmi & Aravindan [14] performed a study that used only the URL of web page as the feature. This has a great advantage because the contents of a web page need not be fetched.…”

Section: B Feature Selection and Feature Extraction Techniquesmentioning

confidence: 99%

Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus

Woogue¹,

Pineda²,

Maderazo³

2017

IJCTE

View full text Add to dashboard Cite

Abstract-TheInternet is a powerful instrument that contains hundreds to thousands of resources. There is a need to categorize these resources based on certain categories in order to organize the contents of the Web better. This research aims to build a corpus that would be representative of pre-defined educational categories. This study will experiment on seven different algorithms that will be able to categorize web pages based on educational domain. Many studies about web categorization have already been conducted but is based on a general set of categories. This research will focus primarily on a predefined set of categories that are closely related to educational domains. With the use of machine learning, the classifier will be able to analyze what a web page is all about and determine its category. The study will also compare the different classifiers used. As a result, the system will be able to assign a web page to a particular educational domain and can be used by schools to determine the categories of web pages frequently requested by students. Linear SVM was also able to build a lexicon for the different categories. The top words for each category were then determined using this lexicon.Index Terms-Corpus, decision trees, k-nearest neighbor, linear support vector machine, logistic regression, machine learning, multinomial naï ve bayes, multilayer perceptron, natural language processing, web page categorization.

show abstract

Section: B Feature Selection and Feature Extraction Techniquesmentioning

confidence: 99%

Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus

Woogue¹,

Pineda²,

Maderazo³

2017

IJCTE

View full text Add to dashboard Cite

show abstract

“…They reported an F1 measure of 0.525, by applying Maximum Entropy as their classifier on WebKB dataset. For classifying URLs, an n-gram based approach is followed by Rajalakshmi and Aravindan (2011) in which only 3-grams derived from URLs are used as the features. In this approach, the dimensionality of feature vector is restricted to a maximum of 26 3 features.…”

Section: Web Page Classificationmentioning

confidence: 99%

“…URL classification problem is studied by many researchers (Kan, 2004;Kan and Thi, 2005;Baykan et al, 2011;Rajalakshmi and Aravindan, 2011;Singh et al, 2012) and various URL features are suggested in the Science Publications JCS literature. Kan and Thi (2005) suggested segmentation techniques for extracting features from URLs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Supervised Term Weighting Methods for Url Classification

Rajalakshmi¹

2014

Journal of Computer Science

Self Cite

View full text Add to dashboard Cite

Many term weighting methods are suggested in the literature for Information Retrieval and Text Categorization. Term weighting method, a part of feature selection process is not yet explored for URL classification problem. We classify a web page using its URL alone without fetching its content and hence URL based classification is faster than other methods. In this study, we investigate the use of term weighting methods for selecting relevant URL features and their impact on the performance of URL classification. We propose a New Relevance Factor (NRF) for the supervised term weighting method to compute the URL weights and perform multiclass classification of URLs using Naive Bayes Classifier. To evaluate the proposed method, we have conducted various experiments on ODP dataset and our experimental results show that the proposed supervised term weighting method based on NRF is suitable for URL classification. We have achieved 11% improvement in terms of Precision over the existing binary classifier methods and 22% improvement in terms of F1 when compared with existing multiclass classifiers.

show abstract

“…Then, the multiclass decision is made by combining the decisions of all the binary classifiers, and the ties are broken at random when one or more binary classifiers say “yes“ or none of the classifiers say “yes.“ Therefore, it resulted in a poor multiclass performance. In our previous works and the work reported by Baykan et al, only URL features were used for classification, but no feature selection method was applied. In their research work, they suggested the use of all n ‐grams ( n =4 to 8) for a URL classification problem.…”

Section: Introductionmentioning

confidence: 99%

A Naive Bayes approach for URL classification with supervised feature selection and rejection framework

Rajalakshmi

Aravindan²

2018

Computational Intelligence

Self Cite

View full text Add to dashboard Cite

Web page classification has become a challenging task due to the exponential growth of the World Wide Web. Uniform Resource Locator (URL)‐based web page classification systems play an important role, but high accuracy may not be achievable as URL contains minimal information. Nevertheless, URL‐based classifiers along with rejection framework can be used as a first‐level filter in a multistage classifier, and a costlier feature extraction from contents may be done in later stages. However, noisy and irrelevant features present in URL demand feature selection methods for URL classification. Therefore, we propose a supervised feature selection method by which relevant URL features are identified using statistical methods. We propose a new feature weighting method for a Naive Bayes classifier by embedding the term goodness obtained from the feature selection method. We also propose a rejection framework to the Naive Bayes classifier by using posterior probability for determining the confidence score. The proposed method is evaluated on the Open Directory Project and WebKB data sets. Experimental results show that our method can be an effective first‐level filter. McNemar tests confirm that our approach significantly improves the performance.

show abstract

Naive Bayes Approach for Website Classification

Cited by 29 publications

References 3 publications

Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus

Automatic Web Page Categorization Using Machine Learning and Educational-Based Corpus

Supervised Term Weighting Methods for Url Classification

A Naive Bayes approach for URL classification with supervised feature selection and rejection framework

Contact Info

Product

Resources

About