Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages 2019
DOI: 10.18653/v1/w19-0310
|View full text |Cite
|
Sign up to set email alerts
|

Uralic multimedia corpora: ISO/TEI corpus data in the project INEL

Abstract: In this paper, we describe a data processing pipeline used for annotated spoken corpora of Uralic languages created in the INEL (Indigenous Northern Eurasian Languages) project. With this processing pipeline we convert the data into a lossless standard format (ISO/TEI) for long-term preservation while simultaneously enabling a powerful search in this version of the data. For each corpus, the input we are working with is a set of files in EXMARaLDA XML format, which contain transcriptions, multimedia alignment,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 3 publications
0
3
0
Order By: Relevance
“…Visualisation uses various XSL transformations to generate, directly from the ISO/TEI XML file, configurable displays of the transcript (in HTML), a density viewer (in SVG) and configurable video subtitling (in VTT) all of which are synchronised with each other and with the underlying audio or video (see Figure 10). Another corpus analysis platform that now supports the ISO/TEI format is Tsakorpus (Arkhangelskiy et al, 2019), which is one use case for ISO/TEI within the long-term project INEL in Hamburg (Ferger and Jettka, 2020). A project in the related field of language documentation, the international (French/German) DoReCo project (Paschen et al, 2020), developed the Multitool 10 that can generate ISO/TEI as a distribution format for resources in various languages and tool formats.…”
Section: Data Publication and Analysis (Dissemination)mentioning
confidence: 99%
“…Visualisation uses various XSL transformations to generate, directly from the ISO/TEI XML file, configurable displays of the transcript (in HTML), a density viewer (in SVG) and configurable video subtitling (in VTT) all of which are synchronised with each other and with the underlying audio or video (see Figure 10). Another corpus analysis platform that now supports the ISO/TEI format is Tsakorpus (Arkhangelskiy et al, 2019), which is one use case for ISO/TEI within the long-term project INEL in Hamburg (Ferger and Jettka, 2020). A project in the related field of language documentation, the international (French/German) DoReCo project (Paschen et al, 2020), developed the Multitool 10 that can generate ISO/TEI as a distribution format for resources in various languages and tool formats.…”
Section: Data Publication and Analysis (Dissemination)mentioning
confidence: 99%
“…By implementing conversion services and import filters (cf. [28][29][30]), it becomes possible to extend the repository solution to changing requirements and usage scenarios.…”
Section: Interoperable Data Through Standards and Open Formatsmentioning
confidence: 99%
“…In the FDR, resources remain findable through corpus level DOIs, but the fine grained citability and the corresponding possibility of building new virtual collections will be lost, with resources only provided as a set of files. The DataCite metadata schema 30 does not provide specific elements for language resources as the CLARIN CMDI metadata profiles of the HZSK do, and the resources will not be discoverable via the advanced faceted search of the CLARIN VLO. However, as many users are still unaware of specialised services such as the VLO or virtual collections, there is still time to improve support for all kinds of collection/compound resources and the format options for metadata to be provided via OAI-PMH.…”
Section: Distributing Highly Specific Data Via Generic Repositories-fmentioning
confidence: 99%