IberSPEECH 2021 2021
DOI: 10.21437/iberspeech.2021-25
|View full text |Cite
|
Sign up to set email alerts
|

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

Abstract: This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge.The primary system (p-streaming 1500ms nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 16 publications
1
4
0
Order By: Relevance
“…This is in fact a common situation for ASR system builders nowadays: in general, it is expected that (minor) discrepancies between available raw transcriptions and their unavailable verbatim counterparts will be largely compensated by exploiting more training data. Although we agree with this view to a certain extent, we have recently observed that, when available transcriptions are at a significant (WER) distance of the true (verbatim) transcriptions, preprocessing of training data by refined "noise" filtering certainly pays off [16,17]. This being the case with EP data, we 3 https://www.mllp.upv.es/europarl-asr decided to apply this refined filtering to the training set, and also a novel, more advanced kind of transcription "reconstruction" preprocessing which we refer to as verbatimization.…”
Section: Filtering and Verbatimizationsupporting
confidence: 69%
See 2 more Smart Citations
“…This is in fact a common situation for ASR system builders nowadays: in general, it is expected that (minor) discrepancies between available raw transcriptions and their unavailable verbatim counterparts will be largely compensated by exploiting more training data. Although we agree with this view to a certain extent, we have recently observed that, when available transcriptions are at a significant (WER) distance of the true (verbatim) transcriptions, preprocessing of training data by refined "noise" filtering certainly pays off [16,17]. This being the case with EP data, we 3 https://www.mllp.upv.es/europarl-asr decided to apply this refined filtering to the training set, and also a novel, more advanced kind of transcription "reconstruction" preprocessing which we refer to as verbatimization.…”
Section: Filtering and Verbatimizationsupporting
confidence: 69%
“…To provide baseline figures for offline and streaming ASR in the MEP and Guest tasks, a common experimental setting was used to build a different hybrid ASR system for each of the train-ing transcription sets provided: raw, filt and verb. As in [17], acoustic modelling was done by first training context-dependent feed-forward DNN-HMMs with three left-to-right tied states using the transLectures-UPV toolkit [19]. State tying was based on a phonetic decision tree approach [20] which, in our case, produced 15K, 11K, and 15K tied states for, respectively, the raw, filt and verb data.…”
Section: Baseline Experiments and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…This paper describes the participation of the Machine Learning and Language Processing (MLLP) research group from the Valencian Research Institute for Artificial Intelligence (VRAIN), hosted at the Universitat Politècnica de València (UPV), in the Albayzín-RTVE 2020 Speech-to-Text (S2T) Challenge, with an extension focused on building equivalent systems under the 2018 closed data conditions. The article is an extended version of the original submission to the Challenge, presented in IberSPEECH 2020 [1].…”
Section: Introductionmentioning
confidence: 99%
“…New projects in this field has a potential not only to step up the effectiveness, but even expand the footprint of the technology application. Programmers have zeroed in on more complex tasks, namely streaming STT for TV-news to accommodate the needs of hearing-impaired citizens or provide online subtitles (Jorge, Giménez, Baquero-Arnal et al, 2021;Perero-Codosero, Juan, Fernando et al, 2022, Kuzmin & Ivanov, 2021. Needless to say, the uncontrolled noise environment inherent in such fields presents a challenge (Montegro et al, 2021).…”
mentioning
confidence: 99%