2006
DOI: 10.1007/11846406_25
|View full text |Cite
|
Sign up to set email alerts
|

Data-Driven Part-of-Speech Tagging of Kiswahili

Abstract: Abstract. In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers. We observe that the more naive combination methods, like the novel plural … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2009
2009
2020
2020

Publication Types

Select...
7
1

Relationship

4
4

Authors

Journals

citations
Cited by 19 publications
(27 citation statements)
references
References 9 publications
0
26
0
Order By: Relevance
“…Subsequent annotation efforts will undoubtedly greatly benefit from the Northern Sotho tagger developed in the context of this article. While 93.5% tagging accuracy is an encouraging result, this is still not up to par with data-driven taggers for English (Van Halteren et al 2001) or Swahili (De Pauw et al 2006), achieving near-human type of tagging accuracies. This is undoubtedly due to the limited size of the annotated corpus.…”
Section: Towards Large Pos-tagged Corporamentioning
confidence: 95%
See 1 more Smart Citation
“…Subsequent annotation efforts will undoubtedly greatly benefit from the Northern Sotho tagger developed in the context of this article. While 93.5% tagging accuracy is an encouraging result, this is still not up to par with data-driven taggers for English (Van Halteren et al 2001) or Swahili (De Pauw et al 2006), achieving near-human type of tagging accuracies. This is undoubtedly due to the limited size of the annotated corpus.…”
Section: Towards Large Pos-tagged Corporamentioning
confidence: 95%
“…Given the availability of annotated data for some language, all of these tools become a viable option to construct a POS-tagger (De Pauw et al 2006). While there are significant differences in the way these respective data-driven methods implement the solution to the problem, they all have in common that they try to 'mimic' the behaviour of the manual annotators, by trying to capture linguistic patterns using statistical and/or symbolic means.…”
Section: Maximum Entropy Taggingmentioning
confidence: 99%
“…Swahili, or other closely related languages, are spoken by nearly the entire population of the Comoros and by relatively small numbers of people in Burundi, Rwanda, Malawi, Northern Zambia and Mozambique. The language is still understood in the southern ports of the Red Sea and along the coasts of Southern Arabia and the Persian Gulf in the twentieth century [3]. In the Guthrie non genetic classification of Bantu languages, Swahili is included under Zone G. The earliest known documents written in Swahili are letters written in Kilwa in 1711, in the Arabic alphabet.…”
Section: About Swahili / Kiswahili Languagementioning
confidence: 99%
“…There is scarcity of sources in the sense that the digital text resources are few. The recent effort on the same is handled carefully with selected procedure for Swahili [2,3]. For language technology applications such as speech recognition system, text-to-speech synthesis, machine aided translation and web related issues there is a great need for translation and usability of the Swahili language.…”
Section: Introductionmentioning
confidence: 99%
“…A data-driven morpho-syntactic tagger was developed for Swahili 1 (De Pauw et al 2006) and Northern Sotho 2 (De Schryver and De Pauw 2007). An unsupervised approach to morphological analysis has been applied to Luo, a Nilotic language ) and Gikuyu .…”
Section: Bantu Computational Morphological Analysismentioning
confidence: 99%