Hate Speech Annotation: Analysis of an Italian Twitter Corpus

Poletto, Fabio; Stranisci, Marco; Sanguinetti, Manuela; Patti, Viviana; Bosco, Cristina

doi:10.4000/books.aaccademia.2448

Cited by 51 publications

(57 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section we extend the preliminary qualitative analysis of the data presented in a previous study on the tag distribution (Poletto et al, 2017). Figure 1 sums up such distribution over the final version of our corpus.…”

Section: Resultsmentioning

confidence: 70%

“…We obtained a dataset of 236,193 tweets, from which we randomly selected a subset to be annotated. The detailed description of the entire pipeline of the data collection and annotation can be found in Poletto et al (2017). Given the higher degree of complexity that applying such scheme entailed, we first annotated 1,827 tweets, then we performed another data filtering starting from neutral words that more frequently occur in texts annotated as HS in this first dataset: invadere (invade), invasione (invasion), basta (enough), fuori (out), comunist* (communist*), african* (African), barcon* (migrants boat*).…”

Section: Corpus Creation and Descriptionmentioning

confidence: 99%

“…Such categories include, besides HS, aggressiveness, offensiveness, irony and stereotype. After the first annotation phase, we measured the Inter-Annotator Agreement (also described in Poletto et al (2017)) and the results showed a high disagreement in all annotation categories (with a coefficient ranging from k=0.37 for offensiveness to k=0.54 for hate speech). In light of these results, we discussed the possible sources of disagreement, and revised the guidelines accordingly.…”

Section: Annotation Scheme: Tagset Design and Issuesmentioning

confidence: 99%

See 2 more Smart Citations

An Impossible Dialogue! Nominal Utterances and Populist Rhetoric in an Italian Twitter Corpus of Hate Speech against Immigrants

Comandini¹,

Patti²

2019

Proceedings of the Third Workshop on Abusive Language Online

Self Cite

View full text Add to dashboard Cite

The paper describes a recently-created Twitter corpus of about 6,000 tweets, annotated for hate speech against immigrants, and developed to be a reference dataset for an automatic system of hate speech monitoring. The annotation scheme was therefore specifically designed to account for the multiplicity of factors that can contribute to the definition of a hate speech notion, and to offer a broader tagset capable of better representing all those factors, which may increase, or rather mitigate, the impact of the message. This resulted in a scheme that includes, besides hate speech, the following categories: aggressiveness, offensiveness, irony, stereotype, and (on an experimental basis) intensity. The paper hereby presented namely focuses on how this annotation scheme was designed and applied to the corpus. In particular, also comparing the annotation produced by CrowdFlower contributors and by expert annotators, we make some remarks about the value of the novel resource as gold standard, which stems from a preliminary qualitative analysis of the annotated data and on future corpus development.

show abstract

Section: Resultsmentioning

confidence: 70%

Section: Corpus Creation and Descriptionmentioning

confidence: 99%

Section: Annotation Scheme: Tagset Design and Issuesmentioning

confidence: 99%

See 1 more Smart Citation

An Impossible Dialogue! Nominal Utterances and Populist Rhetoric in an Italian Twitter Corpus of Hate Speech against Immigrants

Comandini¹,

Patti²

2019

Proceedings of the Third Workshop on Abusive Language Online

Self Cite

View full text Add to dashboard Cite

show abstract

“…The data are released after the annotation process, which involved non-trained contributors on the crowdsourcing platform Figure Eight (F8) 5 . The annotation scheme applied to the HatEval data is a simplified merge of schemes already applied in the development of corpora for HS detection and misogyny by the organizers (Fersini et al, 2018a,b;, also in the context of funded projects with focus on the tasks topics 6 Poletto et al, 2017). It includes the following categories:…”

Section: Annotationmentioning

confidence: 99%

SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter

Basile¹,

Bosco²,

Fersini³

et al. 2019

Proceedings of the 13th International Workshop on Semantic Evaluation

Self Cite

537

557

View full text Add to dashboard Cite

The paper describes the organization of the SemEval 2019 Task 5 about the detection of hate speech against immigrants and women in Spanish and English messages extracted from Twitter. The task is organized in two related classification subtasks: a main binary subtask for detecting the presence of hate speech, and a finer-grained one devoted to identifying further features in hateful contents such as the aggressive attitude and the target harassed, to distinguish if the incitement is against an individual rather than a group. HatEval has been one of the most popular tasks in SemEval-2019 with a total of 108 submitted runs for Subtask A and 70 runs for Subtask B, from a total of 74 different teams. Data provided for the task are described by showing how they have been collected and annotated. Moreover, the paper provides an analysis and discussion about the participant systems and the results they achieved in both subtasks.

show abstract

“…Several hate speech datasets are publicly available, e.g., for English (Waseem and Hovy, 2016;Davidson et al, 2017;Nobata et al, 2016;Jigsaw, 2018), Spanish (Fersini et al, 2018), Italian (Poletto et al, 2017;Sanguinetti et al, 2018), German (Ross et al, 2016), Hindi (Kumar et al, 2018), and Portuguese (de Pelle and Moreira, 2017). In this section, we analyze the data collection strategy, the annotation method and the dataset properties of three representative hate speech datasets: the Hate speech, Racism and Sexism dataset by Waseem and Hovy (2016), the Offensive Language Dataset by Davidson et al (2017), and the Portuguese News Comments dataset by de Pelle and Moreira (2017).…”

Section: Dataset Annotationmentioning

confidence: 99%

A Hierarchically-Labeled Portuguese Hate Speech Dataset

Fortuna¹,

Silva²,

Soler-Company³

et al. 2019

Proceedings of the Third Workshop on Abusive Language Online

100

View full text Add to dashboard Cite

Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning is applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. First, non-experts annotated the tweets with binary labels ('hate' vs. 'no-hate'). Then, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. The hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.

show abstract

Hate Speech Annotation: Analysis of an Italian Twitter Corpus

Cited by 51 publications

References 1 publication

An Impossible Dialogue! Nominal Utterances and Populist Rhetoric in an Italian Twitter Corpus of Hate Speech against Immigrants

An Impossible Dialogue! Nominal Utterances and Populist Rhetoric in an Italian Twitter Corpus of Hate Speech against Immigrants

SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter

A Hierarchically-Labeled Portuguese Hate Speech Dataset

Contact Info

Product

Resources

About