2022
DOI: 10.1007/s10579-022-09606-3
|View full text |Cite
|
Sign up to set email alerts
|

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Abstract: The Donate Speech campaign has so far succeeded in gathering approximately 3600 h of ordinary, colloquial Finnish speech into the Lahjoita puhetta (Donate Speech) corpus. The corpus includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The primary goals of the collection were to create a representative, large-scale resource to study spontaneous spoken Finnish and to accelerate the development of language technology and speech-based services. In this paper, we present… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 30 publications
0
12
0
Order By: Relevance
“…DSPCon (Enarvi, 2018 ), FinDialogue (Lennes, 2009 ), and Lahjoita puhetta (Moisio et al, 2022 ) corpora represent more spontaneous and conversational forms of Finnish speech. DSPCon contains short conversations between students.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…DSPCon (Enarvi, 2018 ), FinDialogue (Lennes, 2009 ), and Lahjoita puhetta (Moisio et al, 2022 ) corpora represent more spontaneous and conversational forms of Finnish speech. DSPCon contains short conversations between students.…”
Section: Related Workmentioning
confidence: 99%
“…This is a known issue with AED models on the Lahjoita puhetta test set. To alleviate the problem, we applied a simple post-processing filter where repetitions are allowed to produce a maximum of five tokens, as originally explained by Moisio et al ( 2022 ).…”
Section: Modelsmentioning
confidence: 99%
“…To determine whether CL indeed helps the training process of e2e ASR systems, we tested a range of different setups. All of our results are based on the Lahjoita puhetta (Donate speech) corpus [17], which is a collection of colloquial Finnish speech. The dataset consists of over twenty thousand speakers from different regions of Finland and a variety of age groups [17].…”
Section: Datamentioning
confidence: 99%
“…All of our results are based on the Lahjoita puhetta (Donate speech) corpus [17], which is a collection of colloquial Finnish speech. The dataset consists of over twenty thousand speakers from different regions of Finland and a variety of age groups [17]. Recording was performed by the users' own devices, resulting in a range of different noise levels in the data.…”
Section: Datamentioning
confidence: 99%
See 1 more Smart Citation