Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1905
|View full text |Cite
|
Sign up to set email alerts
|

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Abstract: We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1 300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament's non-fully-verbatim official transcripts, timealigned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 20 publications
0
2
0
Order By: Relevance
“…are suitably covered due to their commercial interest, while languages spoken by few people or lacking the support of governments struggle to be even considered by major technological giants. This issue is not new and has been addressed in two different ways: (1) by fostering the production of language (spoken and text) resources, many of them from parliamentary speeches [2][3][4][5][6][7][8]; and (2) by leveraging the resources produced for other languages, e.g., by adjusting (finetuning) models or systems trained on multilingual data [9,10]. In the case of Basque, to compensate for the lack of interest of private companies, efforts have focused on producing data.…”
Section: Introductionmentioning
confidence: 99%
“…are suitably covered due to their commercial interest, while languages spoken by few people or lacking the support of governments struggle to be even considered by major technological giants. This issue is not new and has been addressed in two different ways: (1) by fostering the production of language (spoken and text) resources, many of them from parliamentary speeches [2][3][4][5][6][7][8]; and (2) by leveraging the resources produced for other languages, e.g., by adjusting (finetuning) models or systems trained on multilingual data [9,10]. In the case of Basque, to compensate for the lack of interest of private companies, efforts have focused on producing data.…”
Section: Introductionmentioning
confidence: 99%
“…with well-defined and widely used latency metrics. From our group we proposed tasks following this idea, but there is still more work to do (Iranzo-Sánchez et al 2020;Díaz-Munío et al 2021). This kind of datasets will frame the research and boost the interest from academia to work on real-world streaming conditions, and not only on academic tasks.…”
Section: Discussionmentioning
confidence: 99%