RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning

Khalid, Usama; Beg, Mirza Omer; Arshad, Muhammad Umair

doi:10.48550/arxiv.2102.11278

Cited by 2 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The outcomes illustrated that the proposed model exhibited greater robustness compared to the baseline approaches. Moreover, the authors of [25] developed the RUBERT model by retraining the English BERT on Roman Urdu text. The study also involved building the BERT model exclusively for Roman Urdu text from scratch.…”

Section: Related Workmentioning

confidence: 99%

Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism

Khan,

Ahmed,

Jan

et al. 2024

IEEE Access

View full text Add to dashboard Cite

The widespread use of the Internet and the tremendous growth of social media have enabled people to connect with each other worldwide. Individuals are free to express themselves online, sharing their photos, videos, and text messages globally. However, such freedom sometimes leads to misuse, as some individuals exploit this platform by posting hateful and abusive comments on forums. The proliferation of abusive language on social media negatively impacts individuals and groups, leading to emotional distress and affecting mental health. It is crucial to automatically detect and filter such abusive content in order to effectively tackle this challenging issue. Detecting abusive language in text messages is challenging due to intentional word concealment and contextual complexity. To counter abusive speech on social media, we need to explore the potential of machine learning (ML) and deep learning (DL) models, particularly those equipped with attention mechanisms. In this study, we utilized popular ML and DL models integrated with attention mechanism to detect abusive language in Urdu text. Our methodology involved employing Count Vectorizer and Term Frequency-Inverse Document Frequency (TF/IDF) to extract n-grams at the word level: Unigrams (Uni), Bigrams (Bi), Trigrams (Tri), and their combination (Uni + Bi + Tri). Initially, we evaluated four traditional ML models-Logistic Regression (LR), Gaussian Naïve Bayes (NB), Support Vector Machine (SVM), and Random Forest (RF)-on both proposed and established datasets. The results highlighted that RF model outperformed other conventional models in terms of accuracy, precision, recall, and F1-measure on both datasets. In our implementation of deep learning models, we employed various models integrated with custom fastText and Word2Vec embeddings, each equipped with an attention layer, except for the Convolutional Neural Network (CNN). Our findings indicated that the Bidirectional Long Short-Term Memory (Bi-LSTM) + attention model, utilizing custom Word2Vec embeddings, exhibited improved performance in detecting abusive language on both datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism

Khan,

Ahmed,

Jan

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…A detail of challenges in multilingual models are explained in detail [41]. Even bilingual language modeling has been found to perform better than multilingual modeling [42]. A study shows that monolingual versions outperform the traditional multilingual models for all datasets.…”

Section: Literature Reviewmentioning

confidence: 99%

A Multi-Layer Holistic Approach for Cursive Text Recognition

et al. 2022

View full text Add to dashboard Cite

Urdu is a widely spoken and narrated language in several South-Asian countries and communities worldwide. It is relatively hard to recognize Urdu text compared to other languages due to its cursive writing style. The Urdu text script belongs to a non-Latin cursive family script like Arabic, Hindi and Chinese. Urdu is written in several writing styles, among which `Nastaleeq’ is the most popular and widely used font style. A gap still poses a challenge for localization/detection and recognition of Urdu Nastaleeq text as it follows modified version of Arabic script. This research study presents a methodology to recognize and classify Urdu text in Nastaleeq font, regardless of the text position in the image. The proposed solution is comprised of a two-step methodology. In the first step, text detection is performed using the Connected Component Analysis (CCA) and Long Short-Term Memory Neural Network (LSTM). In the second step, a hybrid Convolution Neural Network and Recurrent Neural Network (CNN-RNN) architecture is deployed to recognize the detected text. The image containing Urdu text is binarized and segmented to produce a single-line text image fed to the hybrid CNN-RNN model, which recognizes the text and saves it in a text file. The proposed technique outperforms the existing ones by achieving an overall accuracy of 97.47%.

show abstract

RUBERT: A Bilingual Roman Urdu BERT Using Cross Lingual Transfer Learning

Cited by 2 publications

References 28 publications

Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism

Abusive Language Detection in Urdu Text: Leveraging Deep Learning and Attention Mechanism

A Multi-Layer Holistic Approach for Cursive Text Recognition

Contact Info

Product

Resources

About