This paper presents an ongoing project aimed at creation of corpora for minority Uralic languages that contain texts posted on social media. Corpora for Udmurt and Erzya are fully functional; Moksha and Komi-Zyrian are expected to become available in late 2018; Komi-Permyak and Meadow and Hill Mari will be ready in 2019. The paper has a twofold focus. First, I describe the pipeline used to develop the corpora. Second, I explore the linguistic properties of the corpora and how they could be used in certain types of linguistic research. Apart from being generally "noisier" than edited texts in any language (e.g. in terms of higher number of out-of-vocabulary items), social media texts in these languages present additional challenges compared to similar corpora of major languages. One of them is language identification, which is impeded by frequent code switching and borrowing instances. Another is identification of sources, which cannot be performed by entirely automatic crawling. Both problems require some degree of manual intervention. Nevertheless, the resulting corpora are worth the effort. First, the language of the texts is close to the spoken register. This contrasts to most newspapers and fiction, which tend to use partially artificial standardized varieties. Second, a lot of dialectal variation is observed in these corpora, which makes them suitable for dialectological research. Finally, the social media corpora are comparable in size to the collections of other texts available in the digital form for these languages. This makes them a valuable addition to the existing resources for these languages.
In Udmurt, a Uralic language that has experienced long and extensive contact with the dominant Russian language, all four typologically relevant strategies of verbal borrowing are attested: direct and indirect insertion, light verbs, and paradigm insertion. This is unusual both cross-linguistically and for the Uralic family. The paper investigates these strategies and the factors that govern the choice between them. It turns out that, although free variation plays a major role in the distribution of strategies, there are also several important morphological, stylistic and areal factors. By analyzing these factors and the available historical data, I propose a diachronic explanation of the currently observed distribution. The study is mostly based on corpus data collected from contemporary Udmurt-language social media.
In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a lossless standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard 'Transcription of spoken language' with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the "Tsakorpus" search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future. Tiivistelmä Tässä paperissa kuvataan aineistonnprosessointimenetelmä joka on käytössä uralilaisten puhuttujen korpusten luonnissa kieltedokumentointiprojekti INELissä. Prosessointimenetelmää käytetään konvertoimaan dataa häviöttömään ISO/ TEI-standardiformaattiin pitkän aikavälin säilytystä varten sekä samanaikaisesti tehokkaisiin hakutoimintoihin tälle akineistoversiolle. Jokaisen korpuksen lähtöaineistona on joukko tiedostoja EXMARaLDAn XML-formaatissa, joka sisältää transkriptejä,multimediaa kohdennuksineen, morfeemijäsennyksiä ja muita annotaatiota. Ensimmäinen käsittelyaskel on aineiston konvertointi TEI:n osajouk-This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ koon, joka muodostaa ISO-standardin puhutun kielen transkripteille, XSL-transformaatioita käyttäen. Tämän askelen ensisijainen tarkoitus on saada aineisto sellaiseen standardimuotoon joka kelpaa pitkäaikaissäilytykseen. Seuraava oaskel on ISO/TEI-tiedostojen konversio JSON-formaattiin, jota "Tsakorpus"-hakualusta käyttää. Tämän avulla saadaan korpus käytettäväksi internethakuliittymälle. Lisäksi, konversio mahdollistaa muiden ISO/TEI-yhteensopivien korpusten annotaatioiden tuomisen saataville tulevaisuudessa.
This paper presents the QUEST project and describes concepts and tools that are being developed within its framework. The goal of the project is to establish quality criteria and curation criteria for annotated audiovisual language data. Building on existing resources developed by the participating institutions earlier, QUEST also develops tools that could be used to facilitate and verify adherence to these criteria. An important focus of the project is making these tools accessible for researchers without substantial technical background and helping them produce high-quality data. The main tools we intend to provide are a questionnaire and automatic quality assurance for depositors of language resources, both developed as web applications. They are accompanied by a knowledge base, which will contain recommendations and descriptions of best practices established in the course of the project. Conceptually, we consider three main data maturity levels in order to decide on a suitable level of strictness of the quality assurance. This division has been introduced to avoid that a set of ideal quality criteria prevent researchers from depositing or even assessing their (legacy) data. The tools described in the paper are work in progress and are expected to be released by the end of the QUEST project in 2022.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.