UrduThreat@ FIRE2021: Shared Track on Abusive Threat Identification in Urdu

Amjad, Maaz; Zhila, Alisa; Sidorov, Grigori; Labunets, Andrey; Butt, Sabur; Amjad, Hamza Imam; Vitman, Oxana; Gelbukh, Alexander

doi:10.1145/3503162.3505241

Cited by 9 publications

(5 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the purpose of this study, two cross-platform datasets have been gathered for detecting offensive language, one from Twitter platform ( Amjad et al, 2022 ) (referred to as D 1 ) and the other from YouTube (referred to as D 2 ). The type of text on both platforms is different so these datasets are ideal for answering the research question discussed in the introduction.…”

Section: Methodsmentioning

confidence: 99%

“…As in Akhter et al (2021) researchers have used machine learning and deep learning techniques to understand which technique performed well on Roman Urdu and Nastaliq Urdu scripts. In order to improve collaboration and contribution in the field of Urdu abusive language detection, a competition was arranged to come up with novel ways to detect the abusive language in Urdu Nastaliq script ( Amjad et al, 2022 ). Urdu, a language spoken mainly in Pakistan and India, is considered a low-resource language in terms of natural language processing (NLP) research.…”

Section: Literature Reviewmentioning

confidence: 99%

“…It is context-sensitive but there is no capitalization. The Urdu language is an important research language in South Asia ( Daud, Khan & Che, 2017 ) with more than 230 million speakers worldwide ( Amjad et al, 2022 ). Due to complex morphology, grammatical restriction, and low availability of resources, automatic detection of abusive language detection is a layered and complex machine learning task.…”

Section: Literature Reviewmentioning

confidence: 99%

See 2 more Smart Citations

Detection of offensive terms in resource-poor language using machine learning algorithms

Raza,

Mahoto,

Hamdi

et al. 2023

PeerJ Computer Science

View full text Add to dashboard Cite

The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, i.e., Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Detection of offensive terms in resource-poor language using machine learning algorithms

Raza,

Mahoto,

Hamdi

et al. 2023

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…The shared tasks [19] present in this competition are divided into two parts. Where in one part participants have to focus on detecting Abusive language using twitter tweets in Urdu language (Subtask A) 13 and in other part mainly focusing on detecting Threatening language using Twitter tweets in Urdu language (Subtask B) 14 . The presented data has been collected and annotated from Natural Language and Text Processing Laboratory 15 at Center for Computing Research 16 of Instituto Politécnico Nacional, Mexico.…”

Section: Dataset Descriptionmentioning

confidence: 99%

“…There is also one sub-task in HASOC 2020 9 which aimed to identify offensive post in code-mixed dataset. Extending that task further, the organisers of this shared task [13] have build two datasets of 3400, 9950 posts to detect abusive and threatening language in Urdu. Twitter's definition has been followed to describe , whether a post is abusive/non-abusive 10 , and threading/non-threatening 11 .…”

Section: Introductionmentioning

confidence: 99%

Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach

Das¹,

Banerjee²,

Saha³

2021

Preprint

View full text Add to dashboard Cite

Online hatred is a growing concern on many social media platforms. To address this issue, different social media platforms have introduced moderation policies for such content. They also employ moderators who can check the posts violating moderation policies and take appropriate action. Academicians in the abusive language research domain also perform various studies to detect such content better. Although there is extensive research in abusive language detection in English, there is a lacuna in abusive language detection in low resource languages like Hindi, Urdu etc. In this FIRE 2021 shared task -"HASOC -Abusive and Threatening language detection in Urdu" the organisers propose an abusive language detection dataset in Urdu along with threatening language detection.In this paper, we explored several machine learning models such as XGboost, LGBM, m-BERT based models for abusive and threatening content detection in Urdu based on the shared task. We observed the Transformer model specifically trained on abusive language dataset in Arabic helps in getting the best performance. Our model came First for both abusive and threatening content detection with an F1score of 0.88 and 0.54, respectively. We have made our code public 1 .

show abstract