AlexU-BackTranslation-TL at SemEval-2020 Task 12: Improving Offensive Language Detection Using Data Augmentation and Transfer Learning

Ibrahim, Marc; Torki, Marwan; El-Makky, Nagwa M.

doi:10.18653/v1/2020.semeval-1.248

Cited by 10 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Overall translation performance of NMT systems for user reviews of IMDb movies and Amazon products was explored in Lohar, Popović, and Way (2019)and Popović et al (2021). As for hate-speech detection, (Ibrahim, Torki, and El-Makky, 2020) used MT in order to balance the distribution of classes in training data. Existing English tweets were machine-translated into Portuguese (shown to be the best option), and then, these translations were translated back into English.…”

Section: Mt For User-generated Contentmentioning

confidence: 99%

Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

Verma

Popović²,

Poulis³

et al. 2022

Nat. Lang. Eng.

View full text Add to dashboard Cite

Cyberbullying is the wilful and repeated infliction of harm on an individual using the Internet and digital technologies. Similar to face-to-face bullying, cyberbullying can be captured formally using the Routine Activities Model (RAM) whereby the potential victim and bully are brought into proximity of one another via the interaction on online social networking (OSN) platforms. Although the impact of the COVID-19 (SARS-CoV-2) restrictions on the online presence of minors has yet to be fully grasped, studies have reported that 44% of pre-adolescents have encountered more cyberbullying incidents during the COVID-19 lockdown. Transparency reports shared by OSN companies indicate an increased take-downs of cyberbullying-related comments, posts or content by artificially intelligen moderation tools. However, in order to efficiently and effectively detect or identify whether a social media post or comment qualifies as cyberbullying, there are a number factors based on the RAM, which must be taken into account, which includes the identification of cyberbullying roles and forms. This demands the acquisition of large amounts of fine-grained annotated data which is costly and ethically challenging to produce. In addition where fine-grained datasets do exist they may be unavailable in the target language. Manual translation is costly and expensive, however, state-of-the-art neural machine translation offers a workaround. This study presents a first of its kind experiment in leveraging machine translation to automatically translate a unique pre-adolescent cyberbullying gold standard dataset in Italian with fine-grained annotations into English for training and testing a native binary classifier for pre-adolescent cyberbullying. In addition to contributing high-quality English reference translation of the source gold standard, our experiments indicate that the performance of our target binary classifier when trained on machine-translated English output is on par with the source (Italian) classifier.

show abstract

Section: Mt For User-generated Contentmentioning

confidence: 99%

Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

Verma

Popović²,

Poulis³

et al. 2022

Nat. Lang. Eng.

View full text Add to dashboard Cite

show abstract

“…In addition to some trained machine translation models, Google's Cloud Translation API service is a common tool for back-translation widely applied by some works like [7,19,59,42,60,61,10,62,63]. 8 Some works add additional features based on vanilla back-translation.…”

Section: Machine Translationmentioning

confidence: 99%

“…Text Structure classification generation prediction Paraphrasing Thesauruses [5], [93], [49], [7], [42], [60], [44], [45], [98] - [42], [43] Embeddings [8], [49] --MLMs [10], [51], [54] [55] -Rules [10], [7], [11] -[99] MT [42], [60], [10], [12], [59], [61], [63], [7], [19], [66], [100], [98] [13], [58] [42], [57], [15] Seq2Seq [18], [68], [101] [18], [102] [18], [16], [67], [17], [103], [82] Noising Swapping [93], [60], [44], [61], [20], [19] -…”

Section: Textmentioning

confidence: 99%

Data Augmentation Approaches in Natural Language Processing: A Survey

Li,

Hou,

Che

2021

Preprint

View full text Add to dashboard Cite

As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges.

show abstract

“…Backtranslation usually yields the same meaning sentence with alternative words and sometime different sentence structure. The authors at [ITE20] have applied back translation to balance the classes in the OLID dataset using Google Translation API, however they encountered some issues with the sentences generated from translating from the Arabic language since the back translate words was not offensive and thus effectively changed the sentence class however the technique 2 Background succeeded when back-translating from many other languages such as Spanish, German, Portuguese and Italian. This is because the quality of translation between languages are different for a given model.…”

Section: Data Augmentationmentioning

confidence: 99%

Neural Models for Offensive Language Detection

Hamdy

2021

Preprint

View full text Add to dashboard Cite

Offensive language detection is an ever-growing natural language processing (NLP) application. This growth is mainly because of the widespread usage of social networks, which becomes a mainstream channel for people to communicate, work, and enjoy entertainment content. Many incidents of sharing aggressive and offensive content negatively impacted society to a great extend. We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis. We targeted the problem of offensive language detection for building efficient automated models for offensive language detection. With the recent advancements of NLP models, specifically, the Transformer model, which tackled many shortcomings of the standard seq-to-seq techniques. The BERT model has shown state-of-the-art results on many NLP tasks. Although the literature still exploring the reasons for the BERT achievements in the NLP field. Other efficient variants have been developed to improve upon the standard BERT, such as RoBERTa and ALBERT. Moreover, due to the multilingual nature of text on social media that could affect the model decision on a given tween, it is becoming essential to examine multilingual models such as XLM-RoBERTa trained on 100 languages and how did it compare to unilingual models. The RoBERTa based model proved to be the most capable model and achieved the highest F1 score for the tasks. Another critical aspect of a well-rounded offensive language detection system is the speed at which a model can be trained and make inferences. In that respect, we have considered the model run-time and fine-tuned the very efficient implementation of FastText called BlazingText that achieved good results, which is much faster than BERT-based models.

show abstract

AlexU-BackTranslation-TL at SemEval-2020 Task 12: Improving Offensive Language Detection Using Data Augmentation and Transfer Learning

Cited by 10 publications

References 19 publications

Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

Data Augmentation Approaches in Natural Language Processing: A Survey

Neural Models for Offensive Language Detection

Contact Info

Product

Resources

About