2021
DOI: 10.1162/tacl_a_00416
|View full text |Cite
|
Sign up to set email alerts
|

MasakhaNER: Named Entity Recognition for African Languages

Abstract: We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art metho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
29
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 29 publications
(42 citation statements)
references
References 33 publications
(36 reference statements)
1
29
0
Order By: Relevance
“…In particular, most NER efforts have focused on a few European and Asian languages, while African languages have been given little attention. Only seven studies of NER on Amharic have been found in the literature [49] [3] [9] [17] [69] [68] [1]. In these Amharic NER studies, two NER datasets compiled from different sub-sets of the Walta Information Center Corpus [16] are used.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In particular, most NER efforts have focused on a few European and Asian languages, while African languages have been given little attention. Only seven studies of NER on Amharic have been found in the literature [49] [3] [9] [17] [69] [68] [1]. In these Amharic NER studies, two NER datasets compiled from different sub-sets of the Walta Information Center Corpus [16] are used.…”
Section: Related Workmentioning
confidence: 99%
“…In these Amharic NER studies, two NER datasets compiled from different sub-sets of the Walta Information Center Corpus [16] are used. In addition to the Walta Information Center corpus, there is also the Adelani [1] dataset and Sikdar and Gambäck [68] New Mexico State University Computing Research Laboratory dataset, which is annotated for the SAY project. The data is annotated with 6 classes (PER, LOC, ORG, TIME, TTL, and O-other) and it is available on GitHub 1 .…”
Section: Related Workmentioning
confidence: 99%
“…The hyper-parameters for all models are reported in Table 3. Following 47 , ADAMW was used as the optimizer function 72 . Additionally, we observed k-fold cross-validation aided in better performance during training.…”
Section: Ner Modelmentioning
confidence: 99%
“…Our NER models will be evaluated using Precision, Recall, and F1 scores on the train, as well as the test data 47 .…”
Section: Model Evaluationmentioning
confidence: 99%
See 1 more Smart Citation