ParsBERT: Transformer-based Model for Persian Language Understanding

Farahani, Mehrdad; Gharachorloo, Mohammad; Farahani, Marzieh Davoodabadi; Manthouri, Mohammad

doi:10.1007/s11063-021-10528-4

Cited by 104 publications

(65 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We design and implement three versions of a deep-learning based QA system and deploy PersianQuAD as the training set of the QA systems. In line with the state-of-the-art research on QA tasks [36], we used three pre-trained language models in our QA systems: MBERT [37], ParsBERT [38] and ALBERT-FA [39]. MBERT (Multilingual Bidirectional Encoder Representations from Transformers) is a deep bidirectional language model developed by Google.…”

Section: A Methodsmentioning

confidence: 99%

PersianQuAD: The Native Question Answering Dataset for the Persian Language

2022

View full text Add to dashboard Cite

Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: (1) Wikipedia article selection, ( 2) question-answer collection, (3) three-candidates test set preparation, and (4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD is freely available and can be downloaded from here. All the QA systems implemented in this paper are also available here.

show abstract

Section: A Methodsmentioning

confidence: 99%

PersianQuAD: The Native Question Answering Dataset for the Persian Language

2022

View full text Add to dashboard Cite

show abstract

“…As regards the text encoder, we carried out the experiments with three different multilingual models, i.e., the multilingual version of BERT (Devlin et al, 2019) (mBERT) and the base and large versions of XLM-RoBERTa (Conneau et al, 2020) (XLMR-base and XLMR-large, respec-tively). In the monolingual setting, we used the following language-specific models: BERT-de 8 , CamemBERT-large (Martin et al, 2020) 9 , BERTit 10 , and ParsBERT 11 (Farahani et al, 2020), respectively, for German, French, Italian, and Farsi. As for all the other languages covered by the Word-Net datasets, i.e., Bulgarian, Chinese, Croatian, Danish, Dutch, Estonian, Japanese and Korean, we used the pre-trained models made available by TurkuNLP.…”

Section: Methodsmentioning

confidence: 99%

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

Raganato¹,

Pasini²,

Camacho-Collados³

et al. 2020

Preprint

View full text Add to dashboard Cite

The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at https://pilehvar.github.io/xlwic/.

show abstract

“…We have evaluated PQuAD using two pre-trained transformer-based language models, namely ParsBERT (Farahani et al, 2021) and XLM-RoBERTa (Conneau et al, 2020), as well as BiDAF (Levy et al, 2017) which is an attention-based model proposed for MRC.…”

Section: Modelsmentioning

confidence: 99%

PQuAD: A Persian Question Answering Dataset

Darvishi¹,

Shahbodagh²,

Abbasiantaeb³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present Persian Question Answering Dataset (PQuAD), a crowdsourced reading comprehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable. We examine various properties of the dataset to show the diversity and the level of its difficulty as a MRC benchmark. By releasing this dataset, we aim to ease research on Persian reading comprehension and development of persian question answering systems. Our experiments on different state-of-the-art pre-trained contextualized language models shows 74.8% Exact Match (EM) and 87.6% F1-score that can be used as the baseline results for further research on Persian QA.

show abstract

ParsBERT: Transformer-based Model for Persian Language Understanding

Cited by 104 publications

References 21 publications

PersianQuAD: The Native Question Answering Dataset for the Persian Language

PersianQuAD: The Native Question Answering Dataset for the Persian Language

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

PQuAD: A Persian Question Answering Dataset

Contact Info

Product

Resources

About