KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language

Wanjawa, Barack Wamkaya; Wanzare, Lilian D. A.; Indede, Florence; McOnyango, Owen; Muchemi, Lawrence; Ombui, Edward

doi:10.1145/3578553

Cited by 6 publications

(7 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The dataset was created as a collection of magazine PDF editions manually cleaned and reviewed. Work has been carried out to create the Kencorpus dataset [35] which is a Kenyan language corpus for Swahili, Dholuo, and Luhya languages. The Kencorpus contains a subset of Dholuo to Swahili translations and Luhya to Swahili translations.…”

Section: Text Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Nakatumba‐Nabende,

Babirye,

Nabende

et al. 2024

Applied AI Letters

View full text Add to dashboard Cite

Africa has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high‐quality natural language processing resources for low‐resourced African languages. Obtaining high‐quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore‐Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.

show abstract

Section: Text Datasetsmentioning

confidence: 99%

“…The Kencorpus [35], which contains 177 hours of speech data for Swahili, Dholuo, and Luhya languages, was collected using voice recorders and the rest was obtained through collaborating with media houses.…”

Section: Speech Datasetsmentioning

confidence: 99%

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Nakatumba‐Nabende,

Babirye,

Nabende

et al. 2024

Applied AI Letters

View full text Add to dashboard Cite

show abstract

“…In some cases, the source of data is exam questions and the student's answers assigned to them, official documents containing FAQs, various quizzes, etc. The QA datasets for Bulgarian [41], Portuguese [42], Turkish [43], or Kenyan Swahili [44] were created in this way.…”

Section: ) Monolingual Question Answering Datasetsmentioning

confidence: 99%

Slovak Dataset for Multilingual Question Answering

et al. 2023

View full text Add to dashboard Cite

SK-QuAD is the first manually annotated dataset of questions and answers in Slovak. It consists of more than 91k factual questions and answers from various fields. Each question has an answer marked in the corresponding paragraph. It also contains negative examples in the form of "unanswered questions" and "plausible answers". The dataset is published free of charge for scientific use. We aim to contribute to the creation of Slovak or multilingual systems for generating an answer to a question in a natural language. The paper provides an overview of the existing datasets for question answering. It describes the annotation process and statistically analyzes the created content. The dataset expands the possibilities of training and evaluation of multilingual language models. Experiments show that the dataset achieves state-of-the-art results for Slovak and improves question answering for other languages in zeroshot learning. We compare the effect of machine-translated data with manually annotated. Additional data improve the modeling for low-resourced languages.

show abstract

“…Datasets for QA and Information Retrieval tasks have also been created. They are, however, very few and cater to individual languages (Abedissa et al, 2023;Wanjawa et al, 2023) or a small subset of languages spoken in individual countries (Daniel et al, 2019;Zhang et al, 2022). Given the region's large number of linguistically diverse and information-scarce languages, multilingual and cross-lingual datasets are encouraged to catalyze research efforts.…”

Section: Related Workmentioning

confidence: 99%

“…Open Retrieval? # Languages # African Languages XQA (Liu et al, 2019) ✓ ✓ ✓ 9 Nil XOR QA (Asai et al, 2021) ✓ ✓ ✓ 7 Nil XQuAD (Artetxe et al, 2020) ✓ ✗ ✗ 11 Nil MLQA ✓ ✗ ✗ 7 Nil MKQA (Longpre et al, 2021) ✓ ✗ ✓ 26 Nil TyDi QA (Clark et al, 2020) ✓ ✗ ✓ 11 1 AmQA (Abedissa et al, 2023) ✓ ✗ ✗ 1 1 KenSwQuAD (Wanjawa et al, 2023) ✓ ✗ ✗ 1 1 AFRIQA (Ours) ✓ ✓ ✓ 10 10 (see Table 3)…”

Section: Introductionmentioning

confidence: 99%

Cross-lingual Open-Retrieval Question Answering for African Languages

Ogundepo,

Gwadabe,

Rivera

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

African languages have far less in-language content available digitally, making it challenging for question-answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) * Equal contribution. lation and multilingual retrieval methods. Overall, AFRIQA proves challenging for state-ofthe-art QA models. We hope that the dataset enables the development of more equitable QA technology. 1

show abstract

KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language

Cited by 6 publications

References 12 publications

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Slovak Dataset for Multilingual Question Answering

Cross-lingual Open-Retrieval Question Answering for African Languages

Contact Info

Product

Resources

About