Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0

Rouhe, Aku; Virkkunen, Anja; Leinonen, Juho; Kurimo, Mikko

doi:10.21437/interspeech.2022-11318

Cited by 5 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previously, we applied wav2vec 2.0 pretrained Transformers to North Sámi [1]. We found that the hidden Markov model / deep neural network (HMM/DNN) approach outperformed the attention-based encoder-decoder (AED) approach, and thus are continuing our work here focusing on HMM/DNN-systems.…”

Section: Introductionmentioning

confidence: 82%

“…Here, we are able to leverage an additional North Sámi text resource. This resource, called Freecorpus (FC), consists of freely available texts, collected by Giellatekno and Divvun 1 .…”

Section: Datamentioning

confidence: 99%

“…We base our systems on the HMM-system recipe developed in [1]. This section presents the existing recipe.…”

Section: Speech Recognition Systemsmentioning

confidence: 99%

See 2 more Smart Citations

Speech Recognition System Improvements for North Sa ́mi Speaker-dependent and Speaker Independent Tasks

Rouhe,

Kurimo

2023

2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023)

View full text Add to dashboard Cite

We are working on North Sámi, an under-resourced language, for which we have less than ten hours of transcribed speech in total. Previously, we applied wav2vec 2.0 pretrained large Transformer models to this data. However, error rates were still high. Here, we present a series of system improvements to these models, yielding minor performance improvements. We also experiment with a slightly larger text corpus, which provides a further minor performance improvement. Nonetheless, we conclude that more transcribed speech is needed, at least so that standard size development and test sets can be created.

show abstract

Section: Introductionmentioning

confidence: 82%

“…Here, we are able to leverage an additional North Sámi text resource. This resource, called Freecorpus (FC), consists of freely available texts, collected by Giellatekno and Divvun 1 .…”

Section: Datamentioning

confidence: 99%

See 1 more Smart Citation

Speech Recognition System Improvements for North Sa ́mi Speaker-dependent and Speaker Independent Tasks

Rouhe,

Kurimo

2023

2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023)

View full text Add to dashboard Cite

show abstract

“…To the best of our knowledge, this proposed multihead inference is a novel improvement for the HMM/DNN approach, though it resembles an efficient form of model combination. We presented initial results using this approach in [44] and explore it here in more detail. Table I compares the various acoustic model training criteria and output heads used during inference.…”

Section: A Hybrid Hidden Markov Model / Deep Neural Network Systemsmentioning

confidence: 99%

Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale

Rouhe,

Grósz,

Kurimo

2024

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

End-to-End speech recognition has become the center of attention for speech recognition research, but Hybrid Hidden Markov Model Deep Neural Network (HMM/DNN)systems remain a competitive approach in terms of performance. End-to-End models may be better at very large data scales, and HMM / DNN-systems may have an advantage in low-resource scenarios, but the thousand-hour scale is particularly interesting for comparisons. At that scale experiments have not been able to conclusively demonstrate which approach is best, or if the heterogeneous approaches yield similar results.In this work, we work towards answering that question for Attention-based Encoder-Decoder models compared with HMM / DNN-systems. We present two simple experimental design principles, and how to build systems adhering to those principles. We demonstrate how those principles remove confounding variables related to both data, and neural architecture and training. We apply the principles in a set of experiments on three diverse thousand-hour-scale tasks. In our experiments, the HMM / DNNsystems yield equal or better results in almost all cases.

show abstract

Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning

Kim,

Park,

Kang

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0

Cited by 5 publications

References 21 publications

Speech Recognition System Improvements for North Sa ́mi Speaker-dependent and Speaker Independent Tasks

Speech Recognition System Improvements for North Sa ́mi Speaker-dependent and Speaker Independent Tasks

Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale

Accurate Semi-supervised Automatic Speech Recognition via Multi-hypotheses-Based Curriculum Learning

Contact Info

Product

Resources

About