PersianQuAD: The Native Question Answering Dataset for the Persian Language

Kazemi, Arefeh; Mozafari, Jamshid; Nematbakhsh, Mohammad Ali

doi:10.1109/access.2022.3157289

Cited by 9 publications

(4 citation statements)

References 32 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Manual creation of questions and marking of answers to relevant paragraphs was usually done by crowdsourcing by trained workers. To create answers to the questions, either own annotation tools were used, such as AddQA [45], PI-AFanno [46], SAJAD [47], or already existing web crowdsourcing platforms such as Amazon Mechanical Turk [28], Toloka AI [40], or Prolific [48].…”

Section: ) Monolingual Question Answering Datasetsmentioning

confidence: 99%

Slovak Dataset for Multilingual Question Answering

et al. 2023

View full text Add to dashboard Cite

SK-QuAD is the first manually annotated dataset of questions and answers in Slovak. It consists of more than 91k factual questions and answers from various fields. Each question has an answer marked in the corresponding paragraph. It also contains negative examples in the form of "unanswered questions" and "plausible answers". The dataset is published free of charge for scientific use. We aim to contribute to the creation of Slovak or multilingual systems for generating an answer to a question in a natural language. The paper provides an overview of the existing datasets for question answering. It describes the annotation process and statistically analyzes the created content. The dataset expands the possibilities of training and evaluation of multilingual language models. Experiments show that the dataset achieves state-of-the-art results for Slovak and improves question answering for other languages in zeroshot learning. We compare the effect of machine-translated data with manually annotated. Additional data improve the modeling for low-resourced languages.

show abstract

Section: ) Monolingual Question Answering Datasetsmentioning

confidence: 99%

Slovak Dataset for Multilingual Question Answering

et al. 2023

View full text Add to dashboard Cite

show abstract

“…There are Persian datasets for NLP tasks like questionanswering [12], [13], [14], language modeling [19], or sentiment analysis [20]. However, there is no Persian benchmark dataset for the NLU task.…”

Section: Description Of Persian Datasetmentioning

confidence: 99%

A Persian Benchmark for Joint Intent Detection and Slot Filling

Akbari¹,

Karimi²,

Saeedi³

et al. 2023

Preprint

View full text Add to dashboard Cite

“…There are some machine reading comprehension datasets for Persian [ 66 , 67 ]. We build PASD by using the PersianQuAD dataset [ 67 ].…”

Section: Datasetmentioning

confidence: 99%

“…There are some machine reading comprehension datasets for Persian [ 66 , 67 ]. We build PASD by using the PersianQuAD dataset [ 67 ]. PersianQuAD is the first large-scale native machine reading comprehension dataset for question answering for the Persian language.…”

Section: Datasetmentioning

confidence: 99%

PerAnSel: A Novel Deep Neural Network-Based System for Persian Question Answering

Mozafari

Kazemi

Moradi

et al. 2022

Computational Intelligence and Neuroscience

Self Cite

View full text Add to dashboard Cite

Question answering (QA) systems have attracted considerable attention in recent years. They receive the user’s questions in natural language and respond to them with precise answers. Most of the works on QA were initially proposed for the English language, but some research studies have recently been performed on non-English languages. Answer selection (AS) is a critical component in QA systems. To the best of our knowledge, there is no research on AS for the Persian language. Persian is a (1) free word order, (2) right-to-left, (3) morphologically rich, and (4) low-resource language. Deep learning (DL) techniques have shown promising accuracy in AS. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the AS task; most of them are exclusively in English. In order to address the need for a high-quality AS dataset in the Persian language, we present PASD; the first large-scale native AS dataset for the Persian language. To show the quality of PASD, we employed it to train state-of-the-art QA systems. We also present PerAnSel: a novel deep neural network-based system for Persian question answering. Since the Persian language is a free word-order language, in PerAnSel, we parallelize a sequential method and a transformer-based method to handle various orders in the Persian language. We then evaluate PerAnSel on three datasets: PASD, PerCQA, and WikiFA. The experimental results indicate strong performance on the Persian datasets beating state-of-the-art answer selection methods by 10.66 % on PASD, 8.42 % on PerCQA, and 3.08 % on WikiFA datasets in terms of MRR.

show abstract

PersianQuAD: The Native Question Answering Dataset for the Persian Language

Cited by 9 publications

References 32 publications

Slovak Dataset for Multilingual Question Answering

Slovak Dataset for Multilingual Question Answering

A Persian Benchmark for Joint Intent Detection and Slot Filling

PerAnSel: A Novel Deep Neural Network-Based System for Persian Question Answering

Contact Info

Product

Resources

About