Legal Framework, Dataset and Annotation Schema for Socially
            Unacceptable Online Discourse Practices in Slovene

Fišer, Darja; Erjavec, Tomaž; Ljubešić, Nikola

doi:10.18653/v1/w17-3007

Cited by 65 publications

(52 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For annotating the datasets, we used a two-dimensional annotation schema an early version of which was presented in [2], covering both the type of potentially socially unacceptable discourse and the target this discourse is aimed at. The annotation was performed in PyBossa, 23 a web-based crowdsourcing tool.…”

Section: Dataset Annotationmentioning

confidence: 99%

“…The annotation schemas used in these datasets are very different, ranging from encoding multiple toxicity levels, covert vs. overt aggressiveness, the target of the inappropriateness only etc. The first two pieces of work to take into account both the type of SUD and its target are the annotation schema presented in [2] (which is used in the dataset presented in this paper) and the OLID dataset [9].…”

Section: Introductionmentioning

confidence: 99%

“…Each comment is annotated with a two-dimensional annotation schema for SUD, covering both the type and the target of SUD. The main contributions of this paper are the following: (1) we offer a selection of Facebook pages aimed at representativeness and comparability for a specific country / language, (2) we apply an identical formalism on comparable data in two languages, making this the first multilingual dataset annotated for SUD we are aware of, (3) we annotate for a very broad phenomenon of SUD, covering most phenomena various datasets cover in isolation, (4) we annotate full discussion (comment) threads, not isolated short utterances, ensuring both that (a) the annotators are as informed of the context as possible while making their decisions (e.g., annotating tweets in isolation, not knowing their context, is a questionable, but regular practice) and (b) that the context of the comment is available either for analyzing the dataset or using the dataset for (semi)automating the identification of SUD, and (5) we perform a first analysis of this rich dataset, observing interesting phenomena both across topics and across languages.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

Ljubešić

Fišer

Erjavec

2019

Text, Speech, and Dialogue

Self Cite

View full text Add to dashboard Cite

0000−0001−7169−9152] , Darja Fišer 2,1[0000−0002−9956−1689] , and Tomaž Erjavec 1[0000−0002−1560−4099]Abstract. In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK 3 which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD). The main advantages of these datasets compared to the existing ones are identical sampling procedures, producing comparable data across languages and an annotation schema that takes into account six types of SUD and five targets at which SUD is directed. We describe the sampling and annotation procedures, and analyze the annotation distributions and inter-annotator agreements. We consider this dataset to be an important milestone in understanding and combating SUD for both languages.

show abstract

Section: Dataset Annotationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

Ljubešić

Fišer

Erjavec

2019

Text, Speech, and Dialogue

Self Cite

View full text Add to dashboard Cite

show abstract

“…They have used two classes: Hate and Non-hate. [16] proposed hate speech classification task Slovene language at SN Computer Science multiple granularities. At a coarse level, they have identified two classes SUD (Socially Unacceptable Online Discourse), and not SUD.…”

Section: State Of the Artmentioning

confidence: 99%

Tracking Hate in Social Media: Evaluation, Challenges and Approaches

et al. 2020

View full text Add to dashboard Cite

This paper presents online hate speech as a societal and computational challenge. Offensive content detection in social media is considered as a multilingual, multi-level, multi-class classification problem for three Indo-European languages. This research problem is offered to the community through the HASOC shared task. HASOC intends to stimulate research and development in hate speech recognition across different languages. Three datasets (in English, German, and Hindi) were developed from Twitter and Facebook, and made available. This paper describes the creation of the multilingual datasets and the annotation method. We will present the numerous approaches based on traditional classifiers, deep neural models, and transfer learning models, along with features used for the classification. Results show that the best classifier for the binary classification might not perform best in the multi-class classification, and the performance of the same classifier varies across the languages. Overall, transfer learning models such as BERT, and deep neural models based on LSTMs and CNNs perform similar but better than traditional classifiers such as SVM. We will conclude the discussion with a list of issues that needs to be addressed for future datasets.

show abstract

“…Mubarak et al (2017) addresses abusive language detection on Arabic social media and Su et al (2017) presents a system to detect and rephrase profanity in Chinese. Hate speech and abusive language datasets have been recently annotated for German (Ross et al, 2016) and Slovene (Fišer et al, 2017) opening avenues for future work in languages other than English.…”

Section: Introductionmentioning

confidence: 99%

Detecting Hate Speech in Social Media

Malmasi¹,

Zampieri²

2017

RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning

227

108

View full text Add to dashboard Cite

In this paper we examine methods to detect hate speech in social media, while distinguishing this from general profanity. We aim to establish lexical baselines for this task by applying supervised classification methods using a recently released dataset annotated for this purpose. As features, our system uses character n-grams, word n-grams and word skip-grams. We obtain results of 78% accuracy in identifying posts across three classes. Results demonstrate that the main challenge lies in discriminating profanity and hate speech from each other. A number of directions for future work are discussed.

show abstract

Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene

Cited by 65 publications

References 4 publications

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

Tracking Hate in Social Media: Evaluation, Challenges and Approaches

Detecting Hate Speech in Social Media

Contact Info

Product

Resources

About