Effective processing of unstructured data using python in Hadoop map reduce

Kousalya, K.; Parvez, Shaik Javed

doi:10.14419/ijet.v7i2.21.12456

Cited by 1 publication

(1 citation statement)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Research related to the use of map-reduce for preprocessing has been carried out to review the algorithmic aspects of parallel processing [25], Scalable Distributed Data Processing [26]- [28], to Effective processing for unstructured data using python [29]. The proposed research uses the python programming language and parallel processing; however, it uses a different kind of pre-processing and algorithm.…”

Section: Introductionmentioning

confidence: 99%

Text Classification Using Genetic Programming with Implementation of Map Reduce and Scraping

Wedashwara

Irmawati

Wijayanto

et al. 2023

JOIV : Int. J. Inform. Visualization

View full text Add to dashboard Cite

Classification of text documents on online media is a big data problem and requires automation. Text classification accuracy can decrease if there are many ambiguous terms between classes. Hadoop Map Reduce is a parallel processing framework for big data that has been widely used for text processing on big data. The study presented text classification using genetic programming by pre-processing text using Hadoop map-reduce and collecting data using web scraping. Genetic programming is used to perform association rule mining (ARM) before text classification to analyze big data patterns. The data used are articles from science-direct with the three keywords. This study aims to perform text classification with ARM-based data pattern analysis and data collection system through web-scraping, pre-processing using map-reduce, and text classification using genetic programming. Through web scraping, data has been collected by reducing duplicates as much as 17718. Map-reduce has tokenized and stopped-word removal with 36639 terms with 5189 unique terms and 31450 common terms. Evaluation of ARM with different amounts of multi-tree data can produce more and longer rules and better support. The multi-tree also produces more specific rules and better ARM performance than a single tree. Text classification evaluation shows that a single tree produces better accuracy (0.7042) than a decision tree (0.6892), and the lowest is a multi-tree(0.6754). The evaluation also shows that the ARM results are not in line with the classification results, where a multi-tree shows the best result (0.3904) from the decision tree (0.3588), and the lowest is a single tree (0.356).

show abstract