Worldwide, semi-automatically extracting terms from corpora is becoming the norm for the compilation of terminology lists, term banks or dictionaries for special purposes. If Africanlanguage terminologists are willing to take their rightful place in the new millennium, they must not only take cognisance of this trend but also be ready to implement the new technology. In this article it is advocated that the best way to do the latter two at this stage, is to opt for computationally straightforward alternatives (i.e. use 'raw corpora') and to make use of widely available software tools (e.g. WordSmith Tools). The main aim is therefore to discover whether or not the semiautomatic extraction of terminology from untagged and unmarked running text by means of basic corpus query software is feasible for the African languages. In order to answer this question a fullblown case study revolving around Northern Sotho linguistic texts is discussed in great detail. The computational results are compared throughout with the outcome of a manual excerption, and vice versa. Attention is given to the concepts 'recall' and 'precision'; different approaches are suggested for the treatment of single-word terms versus multi-word terms; and the various findings are summarised in a Linguistics Terminology lexicon presented as an Appendix.
Working with corpora in the South African Bantu languages has up till now been limited to the utilisation of raw corpora. Such corpora, however, have limited functionality. Thus the next logical step in any NLP application is the development of software for automatic tagging of electronic texts. The development of a tagset is one of the first steps in corpus annotation. The authors of this article argue that the design of a tagset cannot be isolated from the purpose of the tagset, or from the place of the tagset and its design within the bigger picture of the architecture of corpus annotation. Usage-related aspects therefore feature prominently in the design of the tagset for Northern Sotho. It is explained why this proposed tagset is biased towards human readability, rather than machine readability; this choice of a stochastic tagger is motivated, and the relationship between tokenising, tagging, morphological analysis and parsing is discussed. In order to account at least to some extent for the morphological complexity of Northern Sotho at the tagging level, a multilevel annotation is opted for: the first level comprising obligatory information and the second optional and recommended information. Finally, aspects of standardisation are considered against the background of reuse, of sharing of resources, and of possible adaptation for use by other disjunctively written South African Bantu languages. It is not the aim of this article to evaluate the results of any tagging procedure using the proposed tagset. It only describes the design and motivates the choices made with regard to the tagset design. However, an evaluation is in process and results will be published in the near future (cf. Faaß et al., s.a.)
Abstract:In this article it is shown how a corpus-based dictionary grammar may be compiled -that is, a mini-grammar fully based on corpus data and specifically written for use in and integrated with a dictionary. Such an effort is, to the best of our knowledge, a world's first. We exemplify our approach for a Northern Sotho mini-grammar, to be included into a Northern Sotho-English dictionary. Keywords: LEXICOGRAPHY, DICTIONARY, CORPUS, FREQUENCY, MIDDLE MATTER, DICTIONARY GRAMMAR, NORTHERN SOTHO (SESOTHO SA LEBOA)Samenvatting: Een corpusgebaseerde woordenboekgrammatica samenstellen: een voorbeeld voor Noord-Sotho. In dit artikel wordt aangetoond hoe een corpusgebaseerde woordenboekgrammatica kan samengesteld worden -dit is, een minigrammatica die al z'n gegevens rechtstreeks uit een corpus haalt en die speciaal geschreven werd om in een woordenboek gebruikt te worden, en er ook volledig mee geïntegreerd is. Zo'n poging is, voor zover ons bekend, een wereldprimeur. We illustreren onze aanpak voor een minigrammatica van het NoordSotho, bedoeld om gebruikt te worden in een Noord-Sotho-Engels woordenboek. Sleutelwoorden: LEXICOGRAFIE, WOORDENBOEK, CORPUS, FREQUENTIE, MID-DENWERK, WOORDENBOEKGRAMMATICA, NOORD-SOTHO Using corpora beyond a dictionary's central section(s)It is now widely accepted that the use of electronic corpora has become indispensable in modern dictionary making, and this on a variety of levels. But just on how many levels? The macrostructural and microstructural levels immediately spring to mind, and most attention in the scientific literature has indeed also gone to aspects revolving around the corpus-based selection of lemma signs on the one hand, and the corpus-based construction of articles attached to those lemma signs on the other. Any self-respecting dictionary, however, contains much more than 'just' the central text. Good dictionaries also comprise extra matter, invariably distributed across front, middle and back matter sections. If one is serious about corpus-based lexicography, then the extra matter should also be rooted in corpus data. One can come a long way by making sure there is a one-to-one correlation between the central (corpus-based) section(s) and the extra matter (cf. below), but during practical dictionary making this quickly proves not to be sufficient. In this article the focus will be on the creation of a corpus-based dictionary grammar, exemplified for Northern Sotho. The core principles of corpus-based lexicography will be briefly reviewed in order to set the stage, but that review is merely incidental and the reader is referred to Sinclair (1987) and Corréard (2002) for what remain to this day the best collections on the topic. Corpus-based lexicography in a nutshellIn corpus-based lexicography, the main arbiter during the creation of the (initial) macrostructure is the list of frequencies attached to the lemmatised list of inclusion candidates. Clearly, there are as many lemmatisation policies as there are dictionary teams compiling dictionaries, but it remains comm...
Abstract:One of the many implications of the process of language democratization which started post-1994 in South Africa is the empowerment of the previously marginalized South African Bantu languages to become languages of higher functions, i.e. languages of learning and teaching, and also of scientific discourse. This in turn implies the development, consolidation and especially standardization of terminology for each of these languages, and the compilation of LSP dictionaries. This article describes the terminological processing of a technical source text prior to translation, which formed part of the compilation of a Quadrilingual Explanatory Dictionary of Chemistry. It reports on the model of terminology management that was utilized and explores strategies for the internal standardization of terms in the absence of readily available, standardized chemistry terminology. Keywords: TERMINOLOGY MANAGEMENT, TERMINOLOGY STANDARDIZATION, NORTHERN SOTHO CHEMISTRY TERMINOLOGY, USERS' PREFERENCES, TERM EXTRAC-TION, TERM EQUIVALENCE, TECHNICAL TRANSLATION Opsomming: Bestuur en interne standaardisering van chemieterminologie:'n Noord-Sotho gevallestudie. Een van die talle implikasies van die proses van taaldemokratisering wat na 1994 in Suid-Afrika plaasgevind het, is die bemagtiging van die voorheen benadeelde Suid-Afrikaanse Bantoetale om ook tale van hoër funksies te word, dit wil sê tale van onderrig en leer, en ook tale van wetenskaplike diskoers. Dit impliseer die ontwikkeling, konsolidasie en veral standaardisering van terminologie vir elkeen van hierdie tale, asook die saamstel van vakwoordeboeke. Hierdie artikel beskryf die terminologiese prosessering van 'n tegniese teks voor die vertaling daarvan. Die vertaling vorm deel van die samestelling van 'n Viertalige Verklarende Chemiewoordeboek. Die artikel lewer verslag oor die model van terminologiebestuur wat gebruik is *
Studies on corpus-based language teaching are notably absent within the South African educational context; more so with regard to the teaching of African languages. This article explores the possibilities offered by the availability of an electronic corpus to enhance language teaching, and more specifically, the teaching of Northern Sotho as a second additional language at first year university level to first time learners of the language. Particular attention is paid to corpus-based selection and sequencing of learning material, an activity that has hitherto depended on anecdotal evidence and the intuition of the language teacher. A critical evaluation of existing pedagogical material for Northern Sotho reveals that although excellent sources of reference, these works are inadequate for the purpose of teaching Northern Sotho to first time learners. It is indicated that information gleaned from a corpus provides the language teacher with guidance on both micro and macro level with regard to selection and sequencing of learning content.
Background: In recent reviews of autism spectrum disorder screening tools, the Modified Checklist for Autism in Toddlers, Revised with Follow-Up (M-CHAT-R/FTM) has been recommended for use in lower middle-income countries to promote earlier identification.Aim: The study aim was to culturally adapt and translate the M-CHAT-R/FTM into Northern Sotho, a South African language.Setting: An expert panel was purposively selected for the review and focus group discussion that was conducted within an academic context.Method: The source translation (English) was reviewed by bilingual Northern Sotho-English speech-language therapists who made recommendations for cultural adaptation. A double translation method was used, followed by a multidisciplinary expert panel discussion and a self-completed questionnaire.Results: Holistic review of test, additional remarks and grammar and phrasing were identified as the most prominent themes of the panel discussion, emphasising the equivalence of the target translation.Conclusion: A South African culturally adapted English version of the M-CHAT-R/FTM is now available along with the preliminary Northern Sotho version of the M-CHAT-R/FTM. The two versions can now be confirmed by gathering empirical evidence of reliability and validity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.