2019
DOI: 10.48550/arxiv.1907.12895
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2019
2019
2019
2019

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…As the dataset has been a popular and useful resource, it has been further extended with captions in other languages such as Chinese (Li et al, 2016) and Turkish (Unal et al, 2016). However, (Federmann and Lewis, 2017) 4.5-10h audio 7k-18k segments de, en, fr, ja, zh IWSLT '18 (Niehues et al, 2018) 1,565 audio clips 171k segments de, en LibriSpeech (Kocabiyikoglu et al, 2018) 236h audio 131k segments en, fr MuST-C (Di Gangi et al, 2019a) 385-504h audio 211k-280k segments 10 languages MaSS (Boito et al, 2019) 18.5-23h audio 8.2k segments 8 languages as these captions were independently crowdsourced, they are not translations of each other, which makes them less effective for MMT.…”
Section: Flickr8kmentioning
confidence: 99%
See 1 more Smart Citation
“…As the dataset has been a popular and useful resource, it has been further extended with captions in other languages such as Chinese (Li et al, 2016) and Turkish (Unal et al, 2016). However, (Federmann and Lewis, 2017) 4.5-10h audio 7k-18k segments de, en, fr, ja, zh IWSLT '18 (Niehues et al, 2018) 1,565 audio clips 171k segments de, en LibriSpeech (Kocabiyikoglu et al, 2018) 236h audio 131k segments en, fr MuST-C (Di Gangi et al, 2019a) 385-504h audio 211k-280k segments 10 languages MaSS (Boito et al, 2019) 18.5-23h audio 8.2k segments 8 languages as these captions were independently crowdsourced, they are not translations of each other, which makes them less effective for MMT.…”
Section: Flickr8kmentioning
confidence: 99%
“…The Multilingual corpus of Sentence-aligned Spoken utterances (MaSS) (Boito et al, 2019) is a multilingual corpus of read bible verses and chapter names from the New Testament. It is fully multi-parallel across 8 languages (Basque, English, Finnish, French, Hungarian, Romanian, Russian, and Spanish), comprising 56 language pairs in total.…”
Section: Massmentioning
confidence: 99%