A Novel Hybrid Feature Extraction Technique and Spam Review Detection using Ensemble Machine Learning Algorithm by Web Scrapping

Goyal, Navin Kumar; Pal, Arup Kumar; Keswani, Bright; Goyal, Dinesh; Gupta, Mukesh Kumar

doi:10.17485/ijst/v16i29.1500

Cited by 1 publication

(1 citation statement)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Goyal et al used three supervised machine learning methods: Gaussian Naïve Bayes (GNB), Multinomial Naïve Bayes, and Bernoulli Naïve Bayes to detect fake reviews. The GNB classifier outperforms other models in terms of accuracy and F1-score metric, as well as identifying deceptive reviews (7) . The authors used the NLTK library to clean up the review data, which is a predefined functionality, so the pre-processing methodology is not novel.…”

Section: Introductionmentioning

confidence: 94%

Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms

Ghogare,

Dawoodi,

Patil

2024

IJST

View full text Add to dashboard Cite

Objective: This article proposes a content-based spam email classification by applying various text pre-processing techniques. NLP techniques have been applied to pre-process the content of an email to get the optimal performance of spam email classification using machine learning. Method: Several combinations of pre-processing methods, such as stopping, removing tags, converting to lower case, removing punctuation, removing special characters, and natural language processing, were applied to the extracted content from the email with machine learning algorithms like NB, SVM, and RF to classify an email as ham or spam. The standard datasets like Enron and SpamAssassin, along with the personal email dataset collected from Yahoo Mail, are used to evaluate the performance of the models. Findings: Applying stemming in pre-processing to the RF classifier yielded the best results, achieving 99.2% accuracy on the SpamAssassin dataset and 99.3% accuracy on the Enron dataset. Lemmatization followed closely with 99% accuracy. In real-world testing on a personal Yahoo email dataset, the proposed method significantly improved accuracy from 89.82% to 97.28% compared to the email service provider's built-in classifier. Additionally, the study found that SVM performs accurately when stop words are retained. Novelty: This article introduces a unique perspective by highlighting the fine-tuning of pre-processing techniques. The focus is on removing tags and certain special characters, while retaining those that improve spam email classification accuracy. Unlike prior works that primarily emphasize algorithmic approaches and pre-defined processing functions, our research delves into the intricacies of data preparation, showcasing its significant impact on spam email classifiers. These findings emphasize the crucial role of pre-processing and contribute to a more nuanced understanding of effective strategies for robust spam detection. Keywords: Spam, Classification, Pre-processing, NLP, Machine Learning

show abstract

Section: Introductionmentioning

confidence: 94%