Data-Driven Part-of-Speech Tagging of Kiswahili

Pauw, Guy De; Schryver, Gilles‐Maurice de; Wagacha, Peter Waiganjo

doi:10.1007/11846406_25

Cited by 19 publications

(27 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Subsequent annotation efforts will undoubtedly greatly benefit from the Northern Sotho tagger developed in the context of this article. While 93.5% tagging accuracy is an encouraging result, this is still not up to par with data-driven taggers for English (Van Halteren et al 2001) or Swahili (De Pauw et al 2006), achieving near-human type of tagging accuracies. This is undoubtedly due to the limited size of the annotated corpus.…”

Section: Towards Large Pos-tagged Corporamentioning

confidence: 95%

“…Given the availability of annotated data for some language, all of these tools become a viable option to construct a POS-tagger (De Pauw et al 2006). While there are significant differences in the way these respective data-driven methods implement the solution to the problem, they all have in common that they try to 'mimic' the behaviour of the manual annotators, by trying to capture linguistic patterns using statistical and/or symbolic means.…”

Section: Maximum Entropy Taggingmentioning

confidence: 99%

See 1 more Smart Citation

Dictionary Writing System (DWS) + Corpus Query Package (CQP): The Case of TshwaneLex

Schryver¹,

Pauw²

2014

Lex

Self Cite

View full text Add to dashboard Cite

Abstract:In this article the integrated corpus query functionality of the dictionary compilation software TshwaneLex is analysed. Attention is given to the handling of both raw corpus data and annotated corpus data. With regard to the latter it is shown how, with a minimum of human effort, machine learning techniques can be employed to obtain part-of-speech tagged corpora that can be used for lexicographic purposes. All points are illustrated with data drawn from English and Northern Sotho. The tools and techniques themselves, however, are language-independent, and as such the encouraging outcomes of this study are far-reaching.

show abstract

Section: Towards Large Pos-tagged Corporamentioning

confidence: 95%

Section: Maximum Entropy Taggingmentioning

confidence: 99%

Dictionary Writing System (DWS) + Corpus Query Package (CQP): The Case of TshwaneLex

Schryver¹,

Pauw²

2014

Lex

Self Cite

View full text Add to dashboard Cite

show abstract

“…Swahili, or other closely related languages, are spoken by nearly the entire population of the Comoros and by relatively small numbers of people in Burundi, Rwanda, Malawi, Northern Zambia and Mozambique. The language is still understood in the southern ports of the Red Sea and along the coasts of Southern Arabia and the Persian Gulf in the twentieth century [3]. In the Guthrie non genetic classification of Bantu languages, Swahili is included under Zone G. The earliest known documents written in Swahili are letters written in Kilwa in 1711, in the Arabic alphabet.…”

Section: About Swahili / Kiswahili Languagementioning

confidence: 99%

“…There is scarcity of sources in the sense that the digital text resources are few. The recent effort on the same is handled carefully with selected procedure for Swahili [2,3]. For language technology applications such as speech recognition system, text-to-speech synthesis, machine aided translation and web related issues there is a great need for translation and usability of the Swahili language.…”

Section: Introductionmentioning

confidence: 99%

Development of Isolated Numeric Speech Corpus for Swahili Language for Development of Automatic Speech Recognition System

Oirere¹,

Deshmukh²,

Shrishrimal³

2013

IJCA

View full text Add to dashboard Cite

Speech corpus being the basic requirement for the development of Automatic speech recognition (ASR) system, it should be done with much accuracy in order to enhance the performance of the system. This paper describes the proposed procedure to abide while collecting the speech corpus of Swahili language from the native and non native speaker for the development of Automatic Speech Recognition system in Swahili language.

show abstract

“…A data-driven morpho-syntactic tagger was developed for Swahili 1 (De Pauw et al 2006) and Northern Sotho 2 (De Schryver and De Pauw 2007). An unsupervised approach to morphological analysis has been applied to Luo, a Nilotic language ) and Gikuyu .…”

Section: Bantu Computational Morphological Analysismentioning

confidence: 99%

Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes

Pauw

Schryver

2011

Lex

Self Cite

View full text Add to dashboard Cite

show abstract

Data-Driven Part-of-Speech Tagging of Kiswahili

Cited by 19 publications

References 9 publications

Dictionary Writing System (DWS) + Corpus Query Package (CQP): The Case of TshwaneLex

Dictionary Writing System (DWS) + Corpus Query Package (CQP): The Case of TshwaneLex

Development of Isolated Numeric Speech Corpus for Swahili Language for Development of Automatic Speech Recognition System

Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes

Contact Info

Product

Resources

About