2023
DOI: 10.1145/3578553
|View full text |Cite
|
Sign up to set email alerts
|

KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language

Abstract: The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learn… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 12 publications
0
7
0
Order By: Relevance
“…The dataset was created as a collection of magazine PDF editions manually cleaned and reviewed. Work has been carried out to create the Kencorpus dataset [35] which is a Kenyan language corpus for Swahili, Dholuo, and Luhya languages. The Kencorpus contains a subset of Dholuo to Swahili translations and Luhya to Swahili translations.…”
Section: Text Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…The dataset was created as a collection of magazine PDF editions manually cleaned and reviewed. Work has been carried out to create the Kencorpus dataset [35] which is a Kenyan language corpus for Swahili, Dholuo, and Luhya languages. The Kencorpus contains a subset of Dholuo to Swahili translations and Luhya to Swahili translations.…”
Section: Text Datasetsmentioning
confidence: 99%
“…The Kencorpus [35], which contains 177 hours of speech data for Swahili, Dholuo, and Luhya languages, was collected using voice recorders and the rest was obtained through collaborating with media houses.…”
Section: Speech Datasetsmentioning
confidence: 99%
“…In some cases, the source of data is exam questions and the student's answers assigned to them, official documents containing FAQs, various quizzes, etc. The QA datasets for Bulgarian [41], Portuguese [42], Turkish [43], or Kenyan Swahili [44] were created in this way.…”
Section: ) Monolingual Question Answering Datasetsmentioning
confidence: 99%
“…Datasets for QA and Information Retrieval tasks have also been created. They are, however, very few and cater to individual languages (Abedissa et al, 2023;Wanjawa et al, 2023) or a small subset of languages spoken in individual countries (Daniel et al, 2019;Zhang et al, 2022). Given the region's large number of linguistically diverse and information-scarce languages, multilingual and cross-lingual datasets are encouraged to catalyze research efforts.…”
Section: Related Workmentioning
confidence: 99%
“…Open Retrieval? # Languages # African Languages XQA (Liu et al, 2019) ✓ ✓ ✓ 9 Nil XOR QA (Asai et al, 2021) ✓ ✓ ✓ 7 Nil XQuAD (Artetxe et al, 2020) ✓ ✗ ✗ 11 Nil MLQA ✓ ✗ ✗ 7 Nil MKQA (Longpre et al, 2021) ✓ ✗ ✓ 26 Nil TyDi QA (Clark et al, 2020) ✓ ✗ ✓ 11 1 AmQA (Abedissa et al, 2023) ✓ ✗ ✗ 1 1 KenSwQuAD (Wanjawa et al, 2023) ✓ ✗ ✗ 1 1 AFRIQA (Ours) ✓ ✓ ✓ 10 10 (see Table 3)…”
Section: Introductionmentioning
confidence: 99%