MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

Jorge, Javier; Giménez, Adrià; Baquero-Arnal, Pau; Iranzo-Sánchez, Javier; Pérez-González-de-Martos, Alejandro; Díaz-Munío, Gonçal V. Garcés; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchís, Alberto; Juan, Alfons

doi:10.21437/iberspeech.2021-25

Cited by 3 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is in fact a common situation for ASR system builders nowadays: in general, it is expected that (minor) discrepancies between available raw transcriptions and their unavailable verbatim counterparts will be largely compensated by exploiting more training data. Although we agree with this view to a certain extent, we have recently observed that, when available transcriptions are at a significant (WER) distance of the true (verbatim) transcriptions, preprocessing of training data by refined "noise" filtering certainly pays off [16,17]. This being the case with EP data, we 3 https://www.mllp.upv.es/europarl-asr decided to apply this refined filtering to the training set, and also a novel, more advanced kind of transcription "reconstruction" preprocessing which we refer to as verbatimization.…”

Section: Filtering and Verbatimizationsupporting

confidence: 69%

“…To provide baseline figures for offline and streaming ASR in the MEP and Guest tasks, a common experimental setting was used to build a different hybrid ASR system for each of the train-ing transcription sets provided: raw, filt and verb. As in [17], acoustic modelling was done by first training context-dependent feed-forward DNN-HMMs with three left-to-right tied states using the transLectures-UPV toolkit [19]. State tying was based on a phonetic decision tree approach [20] which, in our case, produced 15K, 11K, and 15K tied states for, respectively, the raw, filt and verb data.…”

Section: Baseline Experiments and Resultsmentioning

confidence: 99%

“…On one hand, KenLM [18] was used to train a 4-gram LM with vocabulary size limited to 250K words and OOV ratio 0.4% on the dev sets. On the other hand, an in-house version of FairSeq [22] was used to train a Transformer LM (TLM) with the 4-gram LM vocabulary and the setup described in [17]. As in [21], variance regularization was applied to speed up computation of TLM scores during inference.…”

Section: Baseline Experiments and Resultsmentioning

confidence: 99%

See 2 more Smart Citations

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

et al. 2021

View full text Add to dashboard Cite

We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1 300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament's non-fully-verbatim official transcripts, timealigned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.

show abstract

Section: Filtering and Verbatimizationsupporting

confidence: 69%

Section: Baseline Experiments and Resultsmentioning

confidence: 99%

Section: Baseline Experiments and Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

et al. 2021

View full text Add to dashboard Cite

show abstract

“…This paper describes the participation of the Machine Learning and Language Processing (MLLP) research group from the Valencian Research Institute for Artificial Intelligence (VRAIN), hosted at the Universitat Politècnica de València (UPV), in the Albayzín-RTVE 2020 Speech-to-Text (S2T) Challenge, with an extension focused on building equivalent systems under the 2018 closed data conditions. The article is an extended version of the original submission to the Challenge, presented in IberSPEECH 2020 [1].…”

Section: Introductionmentioning

confidence: 99%

MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

et al. 2022

View full text Add to dashboard Cite

This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzín-RTVE 2020 Speech-to-Text Challenge, and includes an extension of the work consisting of building and evaluating equivalent systems under the closed data conditions from the 2018 challenge. The primary system (p-streaming_1500ms_nlt) was a hybrid ASR system using streaming one-pass decoding with a context window of 1.5 seconds. This system achieved 16.0% WER on the test-2020 set. We also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t which, following a similar configuration as the primary system with a smaller context window of 0.6 s, scored 16.9% WER points on the same test set, with a measured empirical latency of 0.81 ± 0.09 s (mean ± stdev). That is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative. As an extension, the equivalent closed-condition systems obtained 23.3% WER and 23.5% WER, respectively. When evaluated with an unconstrained language model, we obtained 19.9% WER and 20.4% WER; i.e., not far behind the top-performing systems with only 5% of the full acoustic data and with the extra ability of being streaming-capable. Indeed, all of these streaming systems could be put into production environments for automatic captioning of live media streams.

show abstract

“…New projects in this field has a potential not only to step up the effectiveness, but even expand the footprint of the technology application. Programmers have zeroed in on more complex tasks, namely streaming STT for TV-news to accommodate the needs of hearing-impaired citizens or provide online subtitles (Jorge, Giménez, Baquero-Arnal et al, 2021;Perero-Codosero, Juan, Fernando et al, 2022, Kuzmin & Ivanov, 2021. Needless to say, the uncontrolled noise environment inherent in such fields presents a challenge (Montegro et al, 2021).…”

mentioning

confidence: 99%

Boosting Speech-to-Text software potential

Biktimirov¹,

Gruzdev²

2022

RR. Т. and A.L.

View full text Add to dashboard Cite

The article focuses on finding ways of boosting efficiency and accuracy of Speech-to-Text (STT)-powered input. The effort is triggered by the growing popularity of the software among professional translators, which is in line with the general trend of abandoning typing in favor of speech-to-text applications. Insisting that better effectiveness of such programs is contingent on their accuracy, the researchers analyze major factors, both linguistic and technical in nature, affecting the computer-assisted speech transcribing quality. This leads to an experiment, putting the hypothesis to a test. Based on numerical and performance data, errors and their breakdown into categories in an attempt to figure out their origins, it dwells on various approaches to dictation in a combination with several hardware options and configurations. These pave the way for recommendations on the improvement of STT performance based on the Dragon software. The authors arrive at a conclusion that it is possible to boost the STT accuracy up to 99 percent by adjusting the program profile to accommodate phonetic features of the speaker with due consideration of his accent, adding to the dictionary the most complex and rare vocabulary beforehand, and fine-tuning input hardware. Other noteworthy results include ways to overcome the most complex transcribing challenges, i.e. proper names, placenames, abbreviations, etc.

show abstract

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

Cited by 3 publications

References 16 publications

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension

Boosting Speech-to-Text software potential

Contact Info

Product

Resources

About