2022
DOI: 10.7717/peerj-cs.1169
|View full text |Cite
|
Sign up to set email alerts
|

Identification of offensive language in Urdu using semantic and embedding models

Abstract: Automatic identification of offensive/abusive language is very necessary to get rid of unwanted behavior. However, it is more challenging to generalize the solution due to the different grammatical structures and vocabulary of each language. Most of the prior work targeted western languages, however, one study targeted a low-resource language (Urdu). The prior study used basic linguistic features and a small dataset. This study designed a new dataset (collected from popular Pakistani Facebook pages) containing… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
1

Relationship

2
6

Authors

Journals

citations
Cited by 10 publications
(5 citation statements)
references
References 35 publications
(44 reference statements)
0
3
0
Order By: Relevance
“…In addition, it can generate a vector of a specific length for each word by taking a sentence as input. Word2vec has demonstrated significant performance in similar NLP tasks ( Ali & Malik, 2023 ; Hussain, Malik & Masood, 2022 ; Younas, Malik & Ignatov, 2023 ). The skip-gram and continuous bag of words (CBOW) are the two algorithms supported by the word2vec model to generate word embeddings.…”
Section: Framework Methodologymentioning
confidence: 99%
“…In addition, it can generate a vector of a specific length for each word by taking a sentence as input. Word2vec has demonstrated significant performance in similar NLP tasks ( Ali & Malik, 2023 ; Hussain, Malik & Masood, 2022 ; Younas, Malik & Ignatov, 2023 ). The skip-gram and continuous bag of words (CBOW) are the two algorithms supported by the word2vec model to generate word embeddings.…”
Section: Framework Methodologymentioning
confidence: 99%
“…Word2vec word embedding model has shown state-ofthe-art performance in many classification tasks related to the NLP domain [36][37][38]. There are two methods supported by word2vec to generate word embeddings; skip-gram and CBOW.…”
Section: Word2vecmentioning
confidence: 99%
“…Several machine learning techniques have been used to detect the offensive language in Urdu, as Hussain, Malik & Masood (2022) used embedding to train the classifier, Humayoun (2022) used feature combination, Ali et al (2022) used transfer learning, and Das, Banerjee & Mukherjee (2022) used data bootstrapping Table 1 shows machine learning (ML) and deep learning (DL) techniques for classification, datasets used for the model, pre-processing technique and feature selection ( i.e. , term frequency-inverse document frequency (TFIDF), n-gram, lexicon or word embeddings).…”
Section: Literature Reviewmentioning
confidence: 99%