Effect of Morphology and Plasmonic on Au/ZnO Films for Efficient Photoelectrochemical Water Splitting

The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.

show abstract

The Design and Development of a Custom Text Annotator

Karani

Ombui

Gichamba

2019

View full text Add to dashboard Cite

Annotation Framework for Hate Speech Identification in Tweets: Case Study of Tweets During Kenyan Elections

Ombui

Karani

Muchemi

2019

View full text Add to dashboard Cite

Hate Speech Detection in Code-switched Text Messages

Ombui

Muchemi

Wagacha

2019

View full text Add to dashboard Cite

InterlinguaPlus Machine Translation Approach for Local Languages: Ekegusii and Swahili

Ombui¹,

Wagacha²,

Nganga³

2014

View full text Add to dashboard Cite

This paper elucidates the InterlinguaPlus design and its application in bi-directional text translations between Ekegusii and Kiswahili languages unlike the traditional translation pairs, one-by-one. Therefore, any of the languages can be the source or target language. The first section is an overview of the project, which is followed by a brief review of Machine Translation. The next section discusses the implementation of the system using Carabao's open machine translation framework and the results obtained. So far, the translation results have been plausible particularly for the resource-scarce local languages and clearly affirm morphological similarities inherent in Bantu languages.

show abstract

Leveraging Hierarchical Features for HateSpeech Identification in Short Message Texts

Ombui

Muchemi

Wagacha

et al. 2019

View full text Add to dashboard Cite

Psychosocial Features for Hate Speech Detection in Code-switched Texts

Ombui¹,

Muchemi²,

Wagacha³

2021

IJITCS

View full text Add to dashboard Cite

This study examines the problem of hate speech identification in codeswitched text from social media using a natural language processing approach. It explores different features in training nine models and empirically evaluates their predictiveness in identifying hate speech in a ~50k human-annotated dataset. The study espouses a novel approach to handle this challenge by introducing a hierarchical approach that employs Latent Dirichlet Analysis to generate topic models that help build a high-level Psychosocial feature set that we acronym PDC. PDC groups similar meaning words in word families, which is significant in capturing codeswitching during the preprocessing stage for supervised learning models. The high-level PDC features generated are based on a hate speech annotation framework [1] that is largely informed by the duplex theory of hate [2]. Results obtained from frequency-based models using the PDC feature on the dataset comprising of tweets generated during the 2012 and 2017 presidential elections in Kenya indicate an f-score of 83% (precision: 81%, recall: 85%) in identifying hate speech. The study is significant in that it publicly shares a unique codeswitched dataset for hate speech that is valuable for comparative studies. Secondly, it provides a methodology for building a novel PDC feature set to identify nuanced forms of hate speech, camouflaged in codeswitched data, which conventional methods could not adequately identify.

show abstract

Psychosocial Features for Identifying Hate Speech in Social Media Text

Ombui

Muchemi

Wagacha

2021

JESBS

View full text Add to dashboard Cite

This study uses natural language processing to identify hate speech in social media codeswitched text. It trains nine models and tests their predictiveness in recognizing hate speech in a 50k human-annotated dataset. The article proposes a novel hierarchical approach that leverages Latent Dirichlet Analysis to develop topic models that assist build a high-level Psychosocial feature set we call PDC. PDC organizes words into word families, which helps capture codeswitching during preprocessing for supervised learning models. Informed by the duplex theory of hate, the PDC features are based on a hate speech annotation framework. Frequency-based models employing the PDC feature on tweets from the 2012 and 2017 Kenyan presidential elections yielded an f-score of 83 percent (precision: 81 percent, recall: 85 percent) in recognizing hate speech. The study is notable because it publicly exposes a rich codeswitched dataset for comparative studies. Second, it describes how to create a novel PDC feature set to detect subtle types of hate speech hidden in codeswitched data that previous approaches could not detect.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Edward Ombui

KenSwQuAD—A Question Answering Dataset for Swahili Low-resource Language

The Design and Development of a Custom Text Annotator

Annotation Framework for Hate Speech Identification in Tweets: Case Study of Tweets During Kenyan Elections

Hate Speech Detection in Code-switched Text Messages

InterlinguaPlus Machine Translation Approach for Local Languages: Ekegusii and Swahili

Leveraging Hierarchical Features for HateSpeech Identification in Short Message Texts

Psychosocial Features for Hate Speech Detection in Code-switched Texts

Psychosocial Features for Identifying Hate Speech in Social Media Text

Contact Info

Product

Resources

About