A Hierarchically-Labeled Portuguese Hate Speech Dataset

Fortuna, Paula; Silva, João Rocha da; Soler-Company, Juan; Wanner, Leo; Nunes, Sérgio

doi:10.18653/v1/w19-3510

Cited by 109 publications

(107 citation statements)

References 23 publications

Supporting

Mentioning

102

Contrasting

Unclassified

Order By: Relevance

“…Finally, regarding hate speech recognition amongst tweets in Portuguese, our results for the MLP (micro-averaged F 1 = 0.85) outscored those by [Fortuna et al 2019], who report micro-averaged F 1 = 0.72 with LSTM. At this point, it is worth stressing the fact that both models were run in the same corpus, as mentioned in Section 3, thereby reducing the influence of external variables, such as data source and language.…”

Section: Discussion and Model Comparisonsupporting

confidence: 55%

“…In this work we relied on a dataset of tweets in Portuguese [Fortuna et al 2019], collected through Twitter's API from January to March 2017. To build the dataset, tweets were fetched using specific keywords, and then filtered so as to come from user accounts known to produce hate speech material (i.e.…”

Section: Methodsmentioning

confidence: 99%

“…During preprocessing, we followed [Fortuna et al 2019] and removed stop words and punctuation marks using the NLTK (Natural Language Toolkit 2 ) library and the python punctuation package, respectively. Text representations were built under the Bag of Words (BOW) [Fan et al 2008] and N-Gram [Collobert and Weston 2009] paradigms.…”

Section: Methodsmentioning

confidence: 99%

“…Finally, regarding hate speech identification in Portuguese, in [Fortuna et al 2019] the authors apply an LSTM, with pre-trained word embeddings, to detect hate speech in a database of labeled tweets ('hate' vs. 'not hate speech'), obtaining a micro-averaged F 1 score of 0.72. Since their corpus of labeled tweets is freely available for download, we have chosen it as a test bed for our models, also using their LSTM as the main benchmark for the classifiers tested in this research.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Hate Speech Detection in Portuguese with Naïve Bayes, SVM, MLP and Logistic Regression

Silva¹,

Roman²

2020

Anais Do Encontro Nacional De Inteligência Artificial E Computacional (ENIAC 2020)

View full text Add to dashboard Cite

Even though social networks can provide free space for discussing ideas, people can also use them to propagate hate speech and, given the amount of written material in such networks, it becomes necessary to rely on automatic methods for identifying this problem. In this work, we set out to verify the use of some classic Machine Learning algorithms for the task of hate speech detection in tweets written in Portuguese, by testing four different models (SVM, MLP, Logistic Regression and Naïve Bayes) with different configurations. Results show that these algorithms produce better results (in terms of micro-averaged F1 score) than the LSTM used for benchmark, being also competitive to other results by the related literature

show abstract

Section: Discussion and Model Comparisonsupporting

confidence: 55%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Hate Speech Detection in Portuguese with Naïve Bayes, SVM, MLP and Logistic Regression

Silva¹,

Roman²

2020

Anais Do Encontro Nacional De Inteligência Artificial E Computacional (ENIAC 2020)

View full text Add to dashboard Cite

show abstract

“…This would provide a common framework for researchers who want to investigate either the phenomenon at large or one of its many facets. This direction is explored, for example, in a recent work by Fortuna et al (2019). Another major issue are biases in the design and annotation of corpora.…”

Section: Lexical Analysismentioning

confidence: 99%

Resources and benchmark corpora for hate speech detection: a systematic review

Poletto

Basile

Sanguinetti

et al. 2020

Lang Resources & Evaluation

217

210

View full text Add to dashboard Cite

Hate Speech in social media is a complex phenomenon, whose detection has recently gained significant traction in the Natural Language Processing community, as attested by several recent review works. Annotated corpora and benchmarks are key resources, considering the vast number of supervised approaches that have been proposed. Lexica play an important role as well for the development of hate speech detection systems. In this review, we systematically analyze the resources made available by the community at large, including their development methodology, topical focus, language coverage, and other factors. The results of our analysis highlight a heterogeneous, growing landscape, marked by several issues and venues for improvement.

show abstract