This paper describes an ongoing work on the creation of Latvian language resources for the medical domain focusing on digital imaging to develop a medical speech recognition system for Latvian. The language resources include a pronunciation lexicon, a text corpus for language modelling, and an orthographically transcribed speech corpus for the (i) adaptation of the acoustic model, (ii) evaluation of the speech recognition accuracy, (iii) development and testing of rewrite rules for automatic text conversion to the spoken form and back to the written form. This work is part of a larger industry-driven research project which aims at the development of specific Latvian speech recognition systems for the medical domain.
The Latvian Language Learners Corpus (LaVA) developed at the Institute of Mathematics and Computer Science, University of Latvia, includes more than 1000 texts created by foreign Latvian language learners studying at Latvian higher education institutions for the first or second semester reaching A1 (possibly A2) Latvian language proficiency level. The size of the corpus is more than 180 000 words. The morphologically annotated texts have been checked manually; the language learners' errors have been manually annotated. In addition, each text is accompanied by information about the author of the text (metadata): gender, age, native language, knowledge of other languages. When analysing the data, this information can be used to determine how the learner's mother tongue and language skills, in general, affect the acquisition of the Latvian language. Users of the corpus can analyse the data both on the LaVA website (see http://lava.korpuss.lv/search) and in the SketchEngine tool, where the quantitative and qualitative analysis of the data can be performed. The quantitative approach makes it possible to find out the tendencies of the use of a word, word form, or construction and allows to determine the frequency of mistakes made by language learners. In addition, the objectivity of the research is ensured by looking at the data of language learners from different aspects and performing repeated analysis. For example, by statistically analysing the nouns used in learners' texts, it can be concluded that declension 4 nouns are most often used. The next in terms of frequency of use are declension 1, 5 and 2 nouns, while declension 3 and 6 nouns and indeclinable nouns are used very rarely. Qualitative analysis reveals certain features of morphology and word formation, including aspects of syntax, based on empirical data. It is possible to qualitatively analyse the erroneous use of nouns, verbs, or other parts of speech, trying to understand what rules determine this. For example, consider using non-reflexive verbs instead of reflexive verbs, using infinitives instead of finite forms (person forms), using a suffix that does not fit the noun paradigm, etc. According to LaVA data analysis, including learners error analysis, exercises and tests are generated. The exercises are intended to help the language learner to strengthen the linguistic competence of the Latvian language, for example, the use of verb forms in the indicative mood, both in indefinite and perfect tense forms. Exercise creation consists of three stages: (1) analysis of LaVA errors and identification of typical errors, (2) Collecting of sample sentences from various corpora of the Latvian language, for example, LVK2018, Saeima, with word forms and constructions in which language learners most often make mistakes in LaVA texts, (3) generation of different exercises using the selected sample sentences.
Vārdu pamatsecība ir viena no pazīmēm, pēc kuras valodas tiek klasificētas noteiktos tipos. Latviešu valoda pieder pie valodām ar brīvu vārdu secību, bet par tās pamatsecību uzskatāma secība teikuma priekšmets-izteicējs-papildinātājs (SVO). Rakstā vispirms analizēta latviešu valodas vārdu pamatsecība un citi vārdu secības modeļi, kā arī konstatētas no pamatsecības atšķirīgo vārdu secības modeļu galvenās īpatnības. Raksta otrajā daļā analizēts, vai latviešu valodā mainītu vārdu secību papildina arī kādas prosodiskās īpatnības un vai ir vērojamas tipiskas intonācijas iezīmes noteiktām teikuma priekšmeta, izteicēja un papildinātāja kombinācijām. Aplūkotajā valodas materiālā neatkarīgi no teikuma priekšmeta, izteicēja un papildinātāja secības visbiežāk uzsvērtas intonatīvās frāzes beigas. Ja intonatīvā frāze beidzas ar teikuma priekšmetu vai papildinātāju, paralēli uzsvaram intonatīvās frāzes beigās var būt uzsvērts arī izteicējs.Atslēgvārdi: vārdu pamatsecība, vārdu secības modeļi, intonatīvā frāze, uzsvars.Rakstā iztirzāti kritēriji, pēc kuriem tiek identificēta kādas valodas vārdu pamatsecība, aprakstīta latviešu valodas vārdu pamatsecība, ieskicēts papildinātāja tipiskais novietojums, kā arī analizēti vārdu secības modeļi, kuri konstatēti dažādos Latvijas Universitātes Matemātikas un informātikas institūta Mākslīgā intelekta laboratorijā veidotos resursos: 1) Līdzsvarotā mūsdienu latviešu valodas tekstu korpusā (turpmāk tekstā -LVK); 2) Latviešu valodas verbu valences datubāzē (turpmāk tekstā -"Valences datubāze"); 3) Latviešu valodas runas korpusā (turpmāk tekstā -LVRK). Vārdu pamatsecības noteikšanas kritērijiVārdu secības tipoloģijā dominē uzskats, ka uz katru valodu var attiecināt noteiktu pamatsecību (angļu val. basic word order) -ne tikai izteicēja (V), teikuma priekšmeta (S) un papildinātāja (O) 1 pamatsecību, bet arī, piem., nomena un 1 Rakstā izmantoti valodu tipoloģiskās klasifikācijas aprakstā tradicionālie teikuma priekš-meta (S), izteicēja (V) un papildinātāja (O) apzīmējumi (sk. arī Greenberg 1990).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.