2015 Resilience Week (RWS) 2015
DOI: 10.1109/rweek.2015.7287440
|View full text |Cite
|
Sign up to set email alerts
|

Optimal stop word selection for text mining in critical infrastructure domain

Abstract: Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy. First, the presented methodology retains all the stop words in the text preprocessing ph… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 13 publications
0
9
0
Order By: Relevance
“…-Elimination of stop words: The stop words are a set of words that provide little or no semantic meaning in the texts, they are generally the words that appear most frequently in a language and contain prepositions, pronouns, auxiliary verbs, etc. Eliminating stop words is a basic step in pre-processing to perform text mining, which, as the name suggests, consists of removing the stop words from the set of characteristics of the texts [1]. The catalog used contains 613 stop words in Spanish.…”
Section: Methodsmentioning
confidence: 99%
“…-Elimination of stop words: The stop words are a set of words that provide little or no semantic meaning in the texts, they are generally the words that appear most frequently in a language and contain prepositions, pronouns, auxiliary verbs, etc. Eliminating stop words is a basic step in pre-processing to perform text mining, which, as the name suggests, consists of removing the stop words from the set of characteristics of the texts [1]. The catalog used contains 613 stop words in Spanish.…”
Section: Methodsmentioning
confidence: 99%
“…statistical, word distribution in documents using variance measure and using the entropy measure. An evolutionary technique was proposed by [9] to extract the optimal set of stop words from the critical infrastructure domain.…”
Section: Related Studiesmentioning
confidence: 99%
“…Many articles on the bag-of-words method [1,4,7,18,23] show that an integral part of the algorithm is the processing of stop-words. In Amarasinghe, Manic and Hruska [23] this stage was given special attention. They emphasized that the removal of the words leads to the loss of some useful information.…”
Section: B Types Of Text Miningmentioning
confidence: 99%
“…Therefore, it was proposed that an alternate method, in which the stop-words are considered separately from the key words and the dimension is reduced using a genetic algorithm, be used instead. In [23] experiments were carried out that showed that the accuracy of the algorithm increased by two percent. However, their experiments have been conducted on a fairly small amount of data and there are questions as whether or not the proposed method is effective and, most importantly, can it quickly reduce the dimension of stop-words with a large amount of data?…”
Section: B Types Of Text Miningmentioning
confidence: 99%
See 1 more Smart Citation