2020
DOI: 10.1371/journal.pone.0232525
|View full text |Cite
|
Sign up to set email alerts
|

The influence of preprocessing on text classification using a bag-of-words representation

Abstract: Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications. Many of these applications perform preprocessing. There are different types of text preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML tag removal, stopword removal, punctuation mark removal, lemmatization, correction of common misspelled words, and reduction of replicated characters. We hypothesize that the application… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
70
0
3

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 162 publications
(103 citation statements)
references
References 34 publications
0
70
0
3
Order By: Relevance
“…In this context, the authors in [40] observed a decrease in performances of SVM classification models from 70.76% to 55.26% for the task of automatic annotation of clinical text fragments based on codebooks having a large number of categories. Similarly, authors in [41], [42] also reported the underperformance of the employed classification models for text classification as a consequence of removing the stopwords. In the case of the DL models, we have used BiLSTM layers, which handle the long-term dependencies and have the capability to store information for a long duration.…”
Section: ) Stopwords and Their Impact In Text Prepocessingmentioning
confidence: 88%
“…In this context, the authors in [40] observed a decrease in performances of SVM classification models from 70.76% to 55.26% for the task of automatic annotation of clinical text fragments based on codebooks having a large number of categories. Similarly, authors in [41], [42] also reported the underperformance of the employed classification models for text classification as a consequence of removing the stopwords. In the case of the DL models, we have used BiLSTM layers, which handle the long-term dependencies and have the capability to store information for a long duration.…”
Section: ) Stopwords and Their Impact In Text Prepocessingmentioning
confidence: 88%
“…Consequently, it is needed to preprocess the tweets before analyzing them so that all the irrelevant attributes are removed from the datasets to avoid the contradiction of results. In this research, we have preprocessed all the datasets equally at multiple stages as described in the literature (HaCohen-Kerner, Miller & Yigal, 2020) and got improved results. Text pre-processing includes data cleansing by removing the unrelated data, including URLs, stop words, smilies, slang, redundant data, and all other irrelevant material.…”
Section: Data Scrubbing and Transformationmentioning
confidence: 99%
“…To model a mass spectrum using LLDA, it is necessary to represent a mass spectrum as a bag-of-words "document" 23 . First, any fragment having a mass-to-charge-ratio (m/z) below 30 is discarded to remove structurally uninformative fragments.…”
Section: Data Preprocessingmentioning
confidence: 99%