Stop Word Lists in Free Open-source Software Packages

Nothman, Joel; Qin, Hanmin; Yurchak, Roman

doi:10.18653/v1/w18-2502

Cited by 40 publications

(20 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then we deleted all the attached external website addresses, hashtags (#hashtags), mentions (@mentions), emojis, Arabic numbers and stopwords (e.g., prepositions, pronouns etc. ), because such information is considered less meaningful in computational text analysis [ 38 ]. In addition, all the capital letters were converted to lower case (to standardize all the words) and we normalized the text with lemmatization (which refers to group together the inflected forms of a word) before the data are ready for the LDA model analyses.…”

Section: Methodsmentioning

confidence: 99%

Analyzing Spanish News Frames on Twitter during COVID-19—A Network Study of El País and El Mundo

Justicia

2020

IJERPH

View full text Add to dashboard Cite

While COVID-19 is becoming one of the most severe public health crises in the twenty-first century, media coverage about this pandemic is getting more important than ever to make people informed. Drawing on data scraped from Twitter, this study aims to analyze and compare the news updates of two main Spanish newspapers El País and El Mundo during the pandemic. Throughout an automatic process of topic modeling and network analysis methods, this study identifies eight news frames for each newspaper’s Twitter account. Furthermore, the whole pandemic development process is split into three periods—the pre-crisis period, the lockdown period and the recovery period. The networks of the computed frames are visualized by these three segments. This paper contributes to the understanding of how Spanish news media cover public health crises on social media platforms.

show abstract

Section: Methodsmentioning

confidence: 99%

Analyzing Spanish News Frames on Twitter during COVID-19—A Network Study of El País and El Mundo

Justicia

2020

IJERPH

View full text Add to dashboard Cite

show abstract

“…Initially, the words in the report are tokenized into a list of its constituent words. Punctuation and stop words are removed in this step as they are not useful for text analysis [28]. Stemming and lemmatization are also applied to the input to decrease the number of distinct words and consequently reduce the model's complexity.…”

Section: Data Preprocessingmentioning

confidence: 99%

Identifying Incident Causal Factors to Improve Aviation Transportation Safety: Proposing a Deep Learning Approach

Dong

Yang

Ebadi

et al. 2021

Journal of Advanced Transportation

View full text Add to dashboard Cite

Aviation is a complicated transportation system, and safety is of paramount importance because aircraft failure often involves casualties. Prevention is clearly the best strategy for aviation transportation safety. Learning from past incident data to prevent potential accidents from happening has proved to be a successful approach. To prevent potential safety hazards and make effective prevention plans, aviation safety experts identify primary and contributing factors from incident reports. However, safety experts’ review processes have become prohibitively expensive nowadays. The number of incident reports is increasing rapidly due to the acceleration of advances in information technologies and the growth of the commercial and private aviation transportation industries. Consequently, advanced text mining algorithms should be applied to help aviation safety experts facilitate the process of incident data extraction. This paper focuses on constructing deep-learning-based models to identify causal factors from incident reports. First, we prepare the data sets used for training, validation, and testing with approximately 200,000 qualified incident reports from the Aviation Safety Reporting System (ASRS). Then, we take an open-source natural language model, which is well trained with a large corpus of Wikipedia texts, as the baseline and fine-tune it with the texts in incident reports to make it more suited to our specific research task. Finally, we build and train an attention-based long short-term memory (LSTM) model to identify primary and contributing factors in each incident report. The solution we propose has multilabel capability and is automated and customizable, and it is more accurate and adaptable than traditional machine learning methods in extant research. This novel application of deep learning algorithms to the incident reporting system can efficiently improve aviation safety.

show abstract

“…The language dependence of the remaining algorithms can be compensated with a part-of-speech tagger and a list of known stop words for the corresponding language. Although stop lists are readily available, they should be selected with caution [96].…”

Section: Feature Engineeringmentioning

confidence: 99%

Unifying Privacy Policy Detection

Hosseini

Degeling

Utz

et al. 2021

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

Privacy policies have become a focal point of privacy research. With their goal to reflect the privacy practices of a website, service, or app, they are often the starting point for researchers who analyze the accuracy of claimed data practices, user understanding of practices, or control mechanisms for users. Due to vast differences in structure, presentation, and content, it is often challenging to extract privacy policies from online resources like websites for analysis. In the past, researchers have relied on scrapers tailored to the specific analysis or task, which complicates comparing results across different studies. To unify future research in this field, we developed a toolchain to process website privacy policies and prepare them for research purposes. The core part of this chain is a detector module for English and German, using natural language processing and machine learning to automatically determine whether given texts are privacy or cookie policies. We leverage multiple existing data sets to refine our approach, evaluate it on a recently published longitudinal corpus, and show that it contains a number of misclassified documents. We believe that unifying data preparation for the analysis of privacy policies can help make different studies more comparable and is a step towards more thorough analyses. In addition, we provide insights into common pitfalls that may lead to invalid analyses.

show abstract

Stop Word Lists in Free Open-source Software Packages

Cited by 40 publications

References 7 publications

Analyzing Spanish News Frames on Twitter during COVID-19—A Network Study of El País and El Mundo

Analyzing Spanish News Frames on Twitter during COVID-19—A Network Study of El País and El Mundo

Identifying Incident Causal Factors to Improve Aviation Transportation Safety: Proposing a Deep Learning Approach

Unifying Privacy Policy Detection

Contact Info

Product

Resources

About