The development of a hyphenator and compound analyser for Afrikaans The development of two core-technologies for Afrikaans, viz. a hyphenator and a compound analyser is described in this article. As no annotated Afrikaans data existed prior to this project to serve as training data for a machine learning classifier, the core-technologies in question are first developed using a rule-based approach. The rule-based hyphenator and compound analyser are evaluated and the hyphenator obtains an fscore of 90,84%, while the compound analyser only reaches an f-score of 78,20%. Since these results are somewhat disappointing and/or insufficient for practical implementation, it was decided that a machine learning technique (memory-based learning) will be used instead. Training data for each of the two core-technologies is then developed using “TurboAnnotate”, an interface designed to improve the accuracy and speed of manual annotation. The hyphenator developed using machine learning has been trained with 39 943 words and reaches an fscore of 98,11% while the f-score of the compound analyser is 90,57% after being trained with 77 589 annotated words. It is concluded that machine learning (specifically memory-based learning) seems an appropriate approach for developing coretechnologies for Afrikaans.
The BA Language Technology program was recently introduced at the NorthWest University and is, to date, the only of its kind in South Africa. This paper gives an overview of the program, which consists of computational linguistic subjects as well as subjects from languages, computer science, mathematics, and statistics. A brief discussion of the content of the program and specifically the computational linguistics subjects, illustrates that the BA Language Technology program is a vocationally directed, future oriented teaching program, preparing students for both future graduate studies and a career in language technology. By means of an example, it is then illustrated how students and researchers alike benefit from working side by side on research and development projects by using a problembased, project-organized approach to curriculum design and teaching. Modules Modules Modules Modules Computer Science (programming)
x\ Woordsoortetiketteerder vir Afrikaans Suléne Pilon Sentrum vir Tekstegnotogie (CTexT), Navorsingseenheid: Taal en Literatuur in die Suid-Afrikaanse Konteks, Noonves-Universiteit, Potchefstroomkampus e-pos: sulene.piton@nwu.ac.za Opsomming: 'n Woordsoortetiketteerder is 'n belangrike kerntegnologie wat 'n noodsaaklike komponent is van verskeie mensetaaltegnologiese toepassings en dus is dit van kernbelang om 'n woordsoortetiketteerder te ontwikkel vir 'n taal wat 'n ontluikende MTT-industrie het. Die ontwikkeling van 'n eerste woordsoortetiketteerder vir Afrikaans word in hierdie artikel beskryf. Die etiketteerder is ontwikkel deur die TnT-algoritme, wat 'n masjienleeralgoritme gebaseer op 'n Versteekte Markovmodel is, met Afrikaanse data af te rig. Die rede vir die keuse van algoritme word in die artikel uiteengesit. Die woordsoortetiketteerder is geïmplementeer met 'n etiketstel wat spesifiek vir Afrikaans ontwikkel is. Dit is moontlik om die etiketstel op verskillende vlakke van spesifisiteit te implementeer en daarom word die etiketteerder onderwerp aan twee verskillende stelle evaluasies. Die eerste evalueer die etiketteerder met die volledige stel van 139 etikette en die tweede met 'n vereenvoudigde etiketstel bestaande uit slegs 13 etikette. Met die volledige etiketstel bereik die etiketteerder 'n akkuraatheid van 85.87% met 20 000 woorde afrigtingsdata. Wanneer dit op dieselfde teks getoets word, maar met 'n vereenvoudigde weergawe van die etiketstel (13 etikette) geïmplementeer word, bereik dit 'n akkuraatheid van 93.69% met 20 000 woorde afrigtingsdata. Die etiketteerder is dus nog nie akkuraat genoeg om in taaltegnologiese toepassings te gebruik nie, maar dit kan gebruik word om semi-outomaties verdere afrigtingsdata te genereer waarmee 'n meer akkurate woordsoortetiketteerder afgerig kan word.Abstract: A part-of-speech tagger (POS tagger) is an important core technology necessary for the development of various human language technology applications and it is thus of great importance to develop a POS tagger for a language with an emerging human language technology (HLT) industry. The development of a first POS tagger for Afrikaans Is described in this article. The tagger was developed by training the TnT algorithm, a machine learning algorithm based on Hidden Markov Models, with annotated Afrikaans data. The reasons for using this algorithm are explicated in the article. The tagger uses a tagset that was developed specifically for Afrikaans to tag the words in an input text. This tagset can be implemented on different levels of specificity and the tagger therefore is evaluated both with a very specific, fine-grained tagset and with a much more general tagset to determine the effect of the size of a tagset on the accuracy of a POS tagger. With the complete tagset of 139 very specific tags, the tagger is able to tag 85.87% of words correctly after being trained with only 20 000 words. When using a tagset of only 13 general tags, the tagger is 93.69% accurate on the same text after being trained with th...
The development of an inflected form generator for Afrikaans In this article the development of an inflected form generator for Afrikaans is described. Two requirements are set for this inflected form generator, viz. to generate only one specific inflected form of a lemma and to generate all possible inflected forms of a lemma. The decision to use machine learning instead of the more traditional rule-based approach in the development of this core-technology is explained and a brief overview of the development of LIA, a lemmatiser for Afrikaans, is given. Experiments are done with three different methods and it is shown that the most effective way of developing an inflected form generator for Afrikaans is by training different classifiers for each affix. Therefore a classifier is trained to generate a plural form, one to generate the diminutive, one to generate the plural of diminutive, et cetera. The final inflected form generator for Afrikaans (AIL-3) reaches an average accuracy of 86,37% on the training data and 86,88% on a small amount of new data. It is indicated that, with the help of a preprocessing module, AIL-3 meets the requirements that were set for an Afrikaans inflected form generator. Finally suggestions are made on how to improve the accuracy of AIL-3.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.