Improving Moderation of Online Discussions via Interpretable Neural Models

Švec, Andrej; Pikuliak, Matúš; Šimko, Marián; Bieliková, Mária

doi:10.18653/v1/w18-5108

Cited by 14 publications

(5 citation statements)

References 21 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The practical tools for identifying and moderating hate speech, such as profanity filters, content moderation filters, or human driven approaches, have been widely studied [5,10,11,31,32]. In the context of social media platforms, the findings on utilization of practical tools are somewhat puzzling as they either show that social platforms perform too much or too little content moderation and lack transparent decision-making processes [33].…”

Section: Definitions Of Hate Speechmentioning

confidence: 99%

Hate speech operationalization: a preliminary examination of hate speech indicators and their structure

Papcunová

Martončik

Fedáková

et al. 2021

Complex Intell. Syst.

Self Cite

View full text Add to dashboard Cite

Hate speech should be tackled and prosecuted based on how it is operationalized. However, the existing theoretical definitions of hate speech are not sufficiently fleshed out or easily operable. To overcome this inadequacy, and with the help of interdisciplinary experts, we propose an empirical definition of hate speech by providing a list of 10 hate speech indicators and the rationale behind them (the indicators refer to specific, observable, and measurable characteristics that offer a practical definition of hate speech). A preliminary exploratory examination of the structure of hate speech, with the focus on comments related to migrants (one of the most reported grounds of hate speech), revealed that two indicators in particular, denial of human rights and promoting violent behavior, occupy a central role in the network of indicators. Furthermore, we discuss the practical implications of the proposed hate speech indicators—especially (semi-)automatic detection using the latest natural language processing (NLP) and machine learning (ML) methods. Having a set of quantifiable indicators could benefit researchers, human right activists, educators, analysts, and regulators by providing them with a pragmatic approach to hate speech assessment and detection.

show abstract

Section: Definitions Of Hate Speechmentioning

confidence: 99%

Hate speech operationalization: a preliminary examination of hate speech indicators and their structure

Papcunová

Martončik

Fedáková

et al. 2021

Complex Intell. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…A few works have specifically pursued the idea of interpretable ML for abuse detection: (Svec et al 2018) shows that an interpretable model can match human-generated annotations with high precision, while (Pavlopoulos, Malakasiotis, and Androutsopoulos 2017) proposes using explanations to help humans make decisions about borderline instances. (Wang 2018) analyzes pitfalls associated with using interpretable ML for abuse detection.…”

Section: Interpretable Machine Learningmentioning

confidence: 99%

Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity

Carton

Mei

Resnick

2020

ICWSM

View full text Add to dashboard Cite

We present an experimental assessment of the impact of feature attribution-style explanations on human performance in predicting the consensus toxicity of social media posts with advice from an unreliable machine learning model. By doing so we add to a small but growing body of literature inspecting the utility of interpretable machine learning in terms of human outcomes. We also evaluate interpretable machine learning for the first time in the important domain of online toxicity, where fully-automated methods have faced criticism as being inadequate as a measure of toxic behavior.We find that, contrary to expectations, explanations have no significant impact on accuracy or agreement with model predictions, through they do change the distribution of subject error somewhat while reducing the cognitive burden of the task for subjects. Our results contribute to the recognition of an intriguing expectation gap in the field of interpretable machine learning between the general excitement the field has engendered and the ambiguous results of recent experimental work, including this study.

show abstract

“…Moderating classifiers' outcomes is also related to the field of explainable ML, in particular explaining individual classification outcomes (Ribeiro et al, 2016). Studies indicate that explaining relevant words of a class outcome support human annotation tasks by, e.g., reducing the annotation time needed per instance and increasing user trust (Švec et al, 2018;Ribeiro et al, 2016). Our approach is likely to benefit from explaining artificial decisionmaking as well as model uncertainties (Andersen et al, 2020) during the moderation process.…”

Section: Related Workmentioning

confidence: 99%

Efficient, Uncertainty-based Moderation of Neural Networks Text Classifiers

Andersen¹,

Maalej²

2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

To maximize the accuracy and increase the overall acceptance of text classifiers, we propose a framework for the efficient, inoperation moderation of classifiers' output. Our framework focuses on use cases in which F1-scores of modern Neural Networks classifiers (ca. 90%) are still inapplicable in practice. We suggest a semi-automated approach that uses prediction uncertainties to pass unconfident, probably incorrect classifications to human moderators. To minimize the workload, we limit the human moderated data to the point where the accuracy gains saturate and further human effort does not lead to substantial improvements. A series of benchmarking experiments based on three different datasets and three state-of-the-art classifiers show that our framework can improve the classification F1scores by 5.1 to 11.2% (up to approx. 98 to 99%), while reducing the moderation load up to 73.3% compared to a random moderation.

show abstract

Improving Moderation of Online Discussions via Interpretable Neural Models

Cited by 14 publications

References 21 publications

Hate speech operationalization: a preliminary examination of hate speech indicators and their structure

Hate speech operationalization: a preliminary examination of hate speech indicators and their structure

Feature-Based Explanations Don't Help People Detect Misclassifications of Online Toxicity

Efficient, Uncertainty-based Moderation of Neural Networks Text Classifiers

Contact Info

Product

Resources

About