Microblog posts such as tweets frequently contain users’ opinions and thoughts about events, products, people, institutions, etc. However, the usage of social media to prop-agate hate speech is not an uncommon occurrence. Analyzing hateful speech in social media is essential for understanding, fighting and discouraging such actions. We believe that by extracting fragments of text that are semantically similar it is possible to depict recurrent linguistic patterns in certain kinds of discourse. Therefore, we aim to use these patterns to encapsulate frequent statements textually expressed in microblog posts. In this paper, we propose to exploit such linguistic patterns in the context of hate speech. Through a technique that we call SSP (Short Semantic Pattern) mining, we are able to extract sequences of words that share a similar meaning in their word embedding representation. By analyzing the extracted patterns, we reveal some kinds of discourses that are replayed across a dataset, such as racist and sexist statements. Afterwards, we experiment using SSP as features to build classifiers that detect if a tweet contains hate speech (binary classification) and to distinguish between sexist, racist and clean tweets (ternary classification). The SSP instances encountered in tweets containing sexism have shown that a large number of sexist tweets began with the introduction ‘I’m not sexist but’ and ‘Call me sexist but’. Meanwhile, SSP instances found in tweets reproducing racism revealed a prominence of contents against the Islamic religion, associated entities and organizations.
Extrair informações acuradas dos enormes volumes de dados, muitos dos quais não estruturados, gerados em mídias sociais é um grande desafio atualmente, mas com diversas aplicações relevantes, muitas delas ainda latentes. Um dos primeiros e mais decisivos passos deste processo de extração de informação é o reconhecimento de palavras relevantes em textos. Este artigo apresenta um estudo comparativo de métodos e ferramentas para reconhecer palavras relevantes em postagens de microblogs. Dentre diversas ferramentas analisadas, cinco delas foram selecionadas para experimentos com 100 mil tweets. Tais experimentos mostraram alta variabilidade dos resultados de ferramentas distintas, o que sugere a necessidade de melhorias.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.