In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a lossless standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard 'Transcription of spoken language' with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the "Tsakorpus" search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future. Tiivistelmä Tässä paperissa kuvataan aineistonnprosessointimenetelmä joka on käytössä uralilaisten puhuttujen korpusten luonnissa kieltedokumentointiprojekti INELissä. Prosessointimenetelmää käytetään konvertoimaan dataa häviöttömään ISO/ TEI-standardiformaattiin pitkän aikavälin säilytystä varten sekä samanaikaisesti tehokkaisiin hakutoimintoihin tälle akineistoversiolle. Jokaisen korpuksen lähtöaineistona on joukko tiedostoja EXMARaLDAn XML-formaatissa, joka sisältää transkriptejä,multimediaa kohdennuksineen, morfeemijäsennyksiä ja muita annotaatiota. Ensimmäinen käsittelyaskel on aineiston konvertointi TEI:n osajouk-This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ koon, joka muodostaa ISO-standardin puhutun kielen transkripteille, XSL-transformaatioita käyttäen. Tämän askelen ensisijainen tarkoitus on saada aineisto sellaiseen standardimuotoon joka kelpaa pitkäaikaissäilytykseen. Seuraava oaskel on ISO/TEI-tiedostojen konversio JSON-formaattiin, jota "Tsakorpus"-hakualusta käyttää. Tämän avulla saadaan korpus käytettäväksi internethakuliittymälle. Lisäksi, konversio mahdollistaa muiden ISO/TEI-yhteensopivien korpusten annotaatioiden tuomisen saataville tulevaisuudessa.
This paper describes the development of a systematic approach to the creation, management and curation of linguistic resources, particularly spoken language corpora. It also presents first steps towards a framework for continuous quality control to be used within external research projects by non-technical users, and discuss various domain and discipline specific problems and individual solutions. The creation of spoken language corpora is not only a time-consuming and costly process, but the created resources often represent intangible cultural heritage, containing recordings of, for example, extinct languages or historical events. Since high quality resources are needed to enable re-use in as many future contexts as possible, researchers need to be provided with the necessary means for quality control. We believe that this includes methods and tools adapted to Humanities researchers as non-technical users, and that these methods and tools need to be developed to support existing tasks and goals of research projects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.