6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018) 2018
DOI: 10.21437/sltu.2018-8
|View full text |Cite
|
Sign up to set email alerts
|

A Small Griko-Italian Speech Translation Corpus

Abstract: This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 20 minutes of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morphosyntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 31 publications
(46 reference statements)
0
4
0
Order By: Relevance
“…Among severely endangered varieties, Griko is the most represented in NLP. Previous work includes two Griko-Italian parallel corpora: A corpus of narratives with POS annotations and a small speech-derived corpus annotated with morphosyntactic, POS, glosses, and speech-related information (Boito et al, 2018;Lekakou et al, 2013). Other efforts in this space include Molise Slavic, for which field recordings, transcriptions, and Italian and German translations have been made available for the varieties of Acquaviva Collecroce, San Felice, and Montemitro (Breu, 2017).…”
Section: Nlp For Specific Varieties Of Italymentioning
confidence: 99%
“…Among severely endangered varieties, Griko is the most represented in NLP. Previous work includes two Griko-Italian parallel corpora: A corpus of narratives with POS annotations and a small speech-derived corpus annotated with morphosyntactic, POS, glosses, and speech-related information (Boito et al, 2018;Lekakou et al, 2013). Other efforts in this space include Molise Slavic, for which field recordings, transcriptions, and Italian and German translations have been made available for the varieties of Acquaviva Collecroce, San Felice, and Montemitro (Breu, 2017).…”
Section: Nlp For Specific Varieties Of Italymentioning
confidence: 99%
“…We simulate the low-resource scenario by giving the system access to a subset of the data with the possible addition of noise (e.g., Besacier, Zhou, and Gao 2006;Stahlberg et al 2016). Some have compiled small corpora in the course of their work, for example, Griko (Boito et al 2018) or Mboshi (Godard et al 2018a;Rialland et al 2018). Some collaborations have tapped a long-standing collection activity by one of the partners, for example, Yongning Na (Adams et al 2017).…”
Section: Test Sets and Evaluation Measuresmentioning
confidence: 99%
“…There is no large-scale monolingual corpus of Griko, but there are two Griko-Italian parallel corpora (Zanon Boito et al, 2018;, with the smaller one including gold word alignment annotations. However, Griko has never had a consistent orthography, and hence its tokenisation and word segmentation differ across these corpora: the smaller data set is based on orthographic conventions from Italian, while the larger one follows the concept of a phonological word .…”
Section: Grikomentioning
confidence: 99%
“…As with Na, there is no large-scale monolingual data to pretrain with. As cross-lingual resources for Griko, there are two parallel corpora that are both aligned with Italian: one contains 330 sentence pairs with gold word alignment annotations 11 (Zanon Boito et al, 2018), and the second contains about 10k sentence pairs without word alignments 12 . However, since Griko has never had a consistent orthography, tokenisation and word segmentation differ across these corpora : the smaller data set is based on orthographic con-ventions from Italian, while the larger data set follows the concept of a phonological word.…”
Section: Grikomentioning
confidence: 99%