Proceedings of Workshop for NLP Open Source Software (NLP-OSS) 2018
DOI: 10.18653/v1/w18-2502
|View full text |Cite
|
Sign up to set email alerts
|

Stop Word Lists in Free Open-source Software Packages

Abstract: Open-source software (OSS) packages for natural language processing often include stop word lists. Users may apply them without awareness of their surprising omissions (e.g. hasn't but not hadn't) and inclusions (e.g. computer), or their incompatibility with particular tokenizers. Motivated by issues raised about the Scikitlearn stop list, we investigate variation among and consistency within 52 popular English-language stop lists, and propose strategies for mitigating these issues.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
1
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 40 publications
(20 citation statements)
references
References 7 publications
0
20
0
Order By: Relevance
“…Then we deleted all the attached external website addresses, hashtags (#hashtags), mentions (@mentions), emojis, Arabic numbers and stopwords (e.g., prepositions, pronouns etc. ), because such information is considered less meaningful in computational text analysis [ 38 ]. In addition, all the capital letters were converted to lower case (to standardize all the words) and we normalized the text with lemmatization (which refers to group together the inflected forms of a word) before the data are ready for the LDA model analyses.…”
Section: Methodsmentioning
confidence: 99%
“…Then we deleted all the attached external website addresses, hashtags (#hashtags), mentions (@mentions), emojis, Arabic numbers and stopwords (e.g., prepositions, pronouns etc. ), because such information is considered less meaningful in computational text analysis [ 38 ]. In addition, all the capital letters were converted to lower case (to standardize all the words) and we normalized the text with lemmatization (which refers to group together the inflected forms of a word) before the data are ready for the LDA model analyses.…”
Section: Methodsmentioning
confidence: 99%
“…Initially, the words in the report are tokenized into a list of its constituent words. Punctuation and stop words are removed in this step as they are not useful for text analysis [28]. Stemming and lemmatization are also applied to the input to decrease the number of distinct words and consequently reduce the model's complexity.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…The language dependence of the remaining algorithms can be compensated with a part-of-speech tagger and a list of known stop words for the corresponding language. Although stop lists are readily available, they should be selected with caution [96].…”
Section: Feature Engineeringmentioning
confidence: 99%