Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Díaz-Munío, Gonçal V. Garcés; Silvestre-Cerdà, Joan Albert; Jorge, Javier; Pastor, Adrià Giménez; Iranzo-Sánchez, Javier; Baquero-Arnal, Pau; Roselló, Nahuel; Pérez-González-de-Martos, Alejandro; Civera, Jorge; Sanchís, Alberto; Juan, Alfons

doi:10.21437/interspeech.2021-1905

Cited by 4 publications

(2 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…are suitably covered due to their commercial interest, while languages spoken by few people or lacking the support of governments struggle to be even considered by major technological giants. This issue is not new and has been addressed in two different ways: (1) by fostering the production of language (spoken and text) resources, many of them from parliamentary speeches [2][3][4][5][6][7][8]; and (2) by leveraging the resources produced for other languages, e.g., by adjusting (finetuning) models or systems trained on multilingual data [9,10]. In the case of Basque, to compensate for the lack of interest of private companies, efforts have focused on producing data.…”

Section: Introductionmentioning

confidence: 99%

A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

Varona,

Penagarikano,

Bordel

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

The development of speech technology requires large amounts of data to estimate the underlying models. Even when relying on large multilingual pre-trained models, some amount of task-specific data on the target language is needed to fine-tune those models and obtain competitive performance. In this paper, we present a bilingual Basque–Spanish dataset extracted from parliamentary sessions. The dataset is designed to develop and evaluate automatic speech recognition (ASR) systems but can be easily repurposed for other speech-processing tasks (such as speaker or language recognition). The paper first compares the two target languages, emphasizing their similarities at the acoustic-phonetic level, which sets the basis for sharing data and compensating for the relatively small amount of spoken resources available for Basque. Then, Basque Parliament plenary sessions are characterized in terms of organization, topics, speaker turns and the use of the two languages. The paper continues with the description of the data collection procedure (involving both speech and text), the audio formats and conversions along with the creation and postprocessing of text transcriptions and session minutes. Then, it describes the semi-supervised iterative procedure used to cut, rank and select the training segments and the manual supervision process employed to produce the test set. Finally, ASR experiments are presented using state-of-the-art technology to validate the dataset and to set a reference for future works. The datasets, along with models and recipes to reproduce the experiments reported in the paper, are released through Hugging Face.

show abstract

Section: Introductionmentioning

confidence: 99%

A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

Varona,

Penagarikano,

Bordel

et al. 2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…with well-defined and widely used latency metrics. From our group we proposed tasks following this idea, but there is still more work to do (Iranzo-Sánchez et al 2020;Díaz-Munío et al 2021). This kind of datasets will frame the research and boost the interest from academia to work on real-world streaming conditions, and not only on academic tasks.…”

Section: Discussionmentioning

confidence: 99%

Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models

Cano¹

View full text Add to dashboard Cite

Current state-of-the-art models based on Long Short-Term Memory (LSTM) networks have been extensively used in ASR to improve performance. However, using LSTMs under a streaming setup is not straightforward due to real-time constraints. In this paper we present a novel streaming decoder that includes a bidirectional LSTM acoustic model as well as an unidirectional LSTM language model to perform the decoding efficiently while keeping the performance comparable to that of an off-line setup. We perform a one-pass decoding using a sliding window scheme for a bidirectional LSTM acoustic model and an LSTM language model. This has been implemented and assessed under a pure streaming setup, and deployed into our production systems. We report WER and latency figures for the well-known Lib-riSpeech and TED-LIUM tasks, obtaining competitive WER results with low-latency responses.

show abstract

Learning From Flawed Data: Weakly Supervised Automatic Speech Recognition

Gao,

Xu,

Raj

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Cited by 4 publications

References 20 publications

A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

A Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technology

Streaming Automatic Speech Recognition with Hybrid Architectures and Deep Neural Network Models

Learning From Flawed Data: Weakly Supervised Automatic Speech Recognition

Contact Info

Product

Resources

About