Anne Ferger scite author profile

Anne Ferger

2Publications

3Citation Statements Received

2Citation Statements Given

How they've been cited

How they cite others

Affiliations

Paderborn University

Publications

Order By: Most citations

Uralic multimedia corpora: ISO/TEI corpus data in the project INEL

Arkhangelskiy¹,

Ferger²,

Hedeland³

2019

View full text Add to dashboard Cite

In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a lossless standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment, morpheme segmentation and other kinds of annotation. The first step of processing is the conversion of the data into a certain subset of TEI following the ISO standard 'Transcription of spoken language' with the help of an XSL transformation. The primary purpose of this step is to obtain a representation of our data in a standard format, which will ensure its long-term accessibility. The second step is the conversion of the ISO/TEI files to a JSON format used by the "Tsakorpus" search platform. This step allows us to make the corpora available through a web-based search interface. As an addition, the existence of such a converter allows other spoken corpora with ISO/TEI annotation to be made accessible online in the future. Tiivistelmä Tässä paperissa kuvataan aineistonnprosessointimenetelmä joka on käytössä uralilaisten puhuttujen korpusten luonnissa kieltedokumentointiprojekti INELissä. Prosessointimenetelmää käytetään konvertoimaan dataa häviöttömään ISO/ TEI-standardiformaattiin pitkän aikavälin säilytystä varten sekä samanaikaisesti tehokkaisiin hakutoimintoihin tälle akineistoversiolle. Jokaisen korpuksen lähtöaineistona on joukko tiedostoja EXMARaLDAn XML-formaatissa, joka sisältää transkriptejä,multimediaa kohdennuksineen, morfeemijäsennyksiä ja muita annotaatiota. Ensimmäinen käsittelyaskel on aineiston konvertointi TEI:n osajouk-This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ koon, joka muodostaa ISO-standardin puhutun kielen transkripteille, XSL-transformaatioita käyttäen. Tämän askelen ensisijainen tarkoitus on saada aineisto sellaiseen standardimuotoon joka kelpaa pitkäaikaissäilytykseen. Seuraava oaskel on ISO/TEI-tiedostojen konversio JSON-formaattiin, jota "Tsakorpus"-hakualusta käyttää. Tämän avulla saadaan korpus käytettäväksi internethakuliittymälle. Lisäksi, konversio mahdollistaa muiden ISO/TEI-yhteensopivien korpusten annotaatioiden tuomisen saataville tulevaisuudessa.

show abstract

Towards Continuous Quality Control for Spoken Language Corpora

Ferger¹,

Hedeland²

2020

IJDC

View full text Add to dashboard Cite

This paper describes the development of a systematic approach to the creation, management and curation of linguistic resources, particularly spoken language corpora. It also presents first steps towards a framework for continuous quality control to be used within external research projects by non-technical users, and discuss various domain and discipline specific problems and individual solutions. The creation of spoken language corpora is not only a time-consuming and costly process, but the created resources often represent intangible cultural heritage, containing recordings of, for example, extinct languages or historical events. Since high quality resources are needed to enable re-use in as many future contexts as possible, researchers need to be provided with the necessary means for quality control. We believe that this includes methods and tools adapted to Humanities researchers as non-technical users, and that these methods and tools need to be developed to support existing tasks and goals of research projects.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Anne Ferger

Uralic multimedia corpora: ISO/TEI corpus data in the project INEL

Towards Continuous Quality Control for Spoken Language Corpora

Contact Info

Product

Resources

About