MasakhaNER: Named Entity Recognition for African Languages

Adelani, David Ifeoluwa; Abbott, Jade; Neubig, Graham; D'souza, Daniel; Kreutzer, Julia; Lignos, Constantine; Palen-Michel, Chester; Buzaaba, Happy; Rijhwani, Shruti; Ruder, Sebastian; Mayhew, Stephen D.; Azime, Israel Abebe; Muhammad, Shamsuddeen Hassan; Emezue, Chris Chinenye; Nakatumba‐Nabende, Joyce; Perez, Ogayo,; Aremu, Anuoluwapo; Gitau, Catherine; Mbaye, Derguene; Alabi, Jesujoba O.; Yimam, Seid Muhie; Tajuddeen, Gwadabe,; Ezeani, Ignatius; Niyongabo, Rubungo Andre; Mukiibi, Jonathan; Otiende, Verrah; Orife, Iroro; David, Davis; Ngom, Samba; Adewumi, Tosin; Rayson, Paul; Adeyemi, Mofetoluwa; Muriuki, Gerald; Anebi, Emmanuel; Chiamaka, Chukwuneke,; Odu, Nkiruka; Wairagala, Eric Peter; Samuel, Oyerinde,; Siro, Clemencia; Bateesa, Tobius Saul; Oloyede, Temilola; Wambui, Yvonne; Akinode, Victor; Nabagereka, Deborah; Katusiime, Maurice; Awokoya, Ayodele; Mboup, Mouhamadane; Gebreyohannes, Dibora; Tilaye, Henok; Nwaike, Kelechi; Wolde, Degaga; Faye, Abdoulaye; Ahia, Orevaoghene; Dossou, Bonaventure F. P.; Ogueji, Kelechi; Diop, Thierno Ibrahima; Diallo, Abdoulaye Baniré; Akinfaderin, Adewale; Marengereke, Tendai; Osei, Salomey

doi:10.1162/tacl_a_00416

Cited by 29 publications

(42 citation statements)

References 33 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, most NER efforts have focused on a few European and Asian languages, while African languages have been given little attention. Only seven studies of NER on Amharic have been found in the literature [49] [3] [9] [17] [69] [68] [1]. In these Amharic NER studies, two NER datasets compiled from different sub-sets of the Walta Information Center Corpus [16] are used.…”

Section: Related Workmentioning

confidence: 99%

“…In these Amharic NER studies, two NER datasets compiled from different sub-sets of the Walta Information Center Corpus [16] are used. In addition to the Walta Information Center corpus, there is also the Adelani [1] dataset and Sikdar and Gambäck [68] New Mexico State University Computing Research Laboratory dataset, which is annotated for the SAY project. The data is annotated with 6 classes (PER, LOC, ORG, TIME, TTL, and O-other) and it is available on GitHub 1 .…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

Jibril

Tantuğ

2023

IEEE Access

View full text Add to dashboard Cite

Named Entity Recognition is an information extraction task that serves as a pre-processing step for other natural language processing tasks, such as machine translation, information retrieval, and question answering. Named entity recognition enables the identification of proper names as well as temporal and numeric expressions in an open domain text. For Semitic languages such as Arabic, Amharic, and Hebrew, the named entity recognition task is more challenging due to the heavily inflected structure of these languages. In this study, we annotate a new comparatively large Amharic named entity recognition dataset and make it publicly available. Using this new dataset, we build multiple Amharic named entity recognition systems based on recent deep learning approaches including transfer learning (RoBERTa), and bidirectional long short-term memory coupled with a conditional random fields layer. By applying the Synthetic Minority Over-sampling Technique to mitigate the imbalanced classification problem, our best performing RoBERTa based named entity recognition system achieves an f1-score of 93%, which is the new state-of-the-art result for Amharic named entity recognition.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

Jibril

Tantuğ

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The hyper-parameters for all models are reported in Table 3. Following 47 , ADAMW was used as the optimizer function 72 . Additionally, we observed k-fold cross-validation aided in better performance during training.…”

Section: Ner Modelmentioning

confidence: 99%

“…Our NER models will be evaluated using Precision, Recall, and F1 scores on the train, as well as the test data 47 .…”

Section: Model Evaluationmentioning

confidence: 99%

“…South African languages reserving a seat amongst the cohort unable to meet the aforementioned prerequisites 10,11,46 . For this, we aim to explore News Headlines Classification (NHC) 11 and Named Entity Recognition (NER) 47 downstream tasks on four agglutinative of the 11 official South African languages: Isixhosa (languages of the Nguni tribe), Sesotho, Setswana, and Sepedi (three languages of the Sotho-Tswana language family). The unique contributions of this study are organized into the following three paradigms:…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Zero-Shot Transfer Learning using Affix and Correlated Cross-Lingual Embeddings.

Modupe

Sindane

Marivate

2023

Preprint

View full text Add to dashboard Cite

Learning morphologically supplemented embedding spaces using cross-lingual models has become an active area of research and facilitated many research breakthroughs in various applications such as machine translation, named entity recognition, document classification, and natural language inference. However, the field has not become customary for Southern African low-resourced languages. In this paper, we present, evaluate and benchmark a cohort of cross-lingual embeddings for the English-Southern African languages on two classification tasks: News Headlines Classification (NHC) and Named Entity Recognition (NER). Our methodology considers four agglutinative languages from the eleven official South African languages: Isixhosa, Sepedi, Sesotho, and Setswana. Canonical correlation analyses and VecMap are the two cross-lingual alignment strategies adopted for this study. Monolingual embeddings used in this work are Glove (source), and FastText (source and target) embeddings. Our results indicate that with enough comparable corpora, we can develop strong inter-joined representations between English and the considered Southern African languages. More specifically, the best zero-shot transfer results on the available Setswana NHC dataset were achieved using canonically correlated embeddings with Multi-layered perceptron as the training model (54.5% accuracy). Furthermore, our NER best performance was achieved using canonically correlated cross-lingual embeddings with Conditional Random Fields as the training model (96.4% F1 score). Collectively, this study’s results were competitive with the benchmarks of the explored NHC and NER datasets, on both zero-short NHC and NER tasks with our advantage being the use of very minimal resources.

show abstract

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Nakatumba‐Nabende,

Babirye,

Nabende

et al. 2024

Applied AI Letters

Self Cite

View full text Add to dashboard Cite

Africa has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high‐quality natural language processing resources for low‐resourced African languages. Obtaining high‐quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore‐Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.

show abstract

MasakhaNER: Named Entity Recognition for African Languages

Cited by 29 publications

References 33 publications

ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer

Zero-Shot Transfer Learning using Affix and Correlated Cross-Lingual Embeddings.

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Contact Info

Product

Resources

About