Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning

Deng, Keqi; Cao, Songjun; Ma, Long

doi:10.21437/interspeech.2021-1186

Cited by 17 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, a self-supervised training method, wav2vec2.0 [9], has achieved promising results on CTC models, and the pre-trained model is shown to accelerate the convergence during the fine-tuning stage. However, even with the pre-trained model obtained by wav2vec2.0, the CTC model needs an external language model (LM) to relax its conditional independence assumption [9,10]. Several works have investigated incorporating BERT into a NAR ASR model to achieve better recognition accuracies [11][12][13].…”

Section: Introductionmentioning

confidence: 99%

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Deng¹,

Zhang²,

Watanabe³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pretrained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

show abstract

Section: Introductionmentioning

confidence: 99%

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Deng¹,

Zhang²,

Watanabe³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, multi-dialect ASR is an attractive solution in scenarios where sufficient dialect-specific data or information is not available. Therefore, Liu and Fung (2006); Rao and Sak (2017); Jain et al (2018); Yang et al (2018); Fukuda et al (2018); Jain et al (2019); Viglino et al (2019); ; Deng et al (2021) attempt to improve multi-dialect ASR systems. Liu and Fung (2006) use auxiliary accent trees to model Chinese accent variation.…”

Section: Introductionmentioning

confidence: 99%

Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition

Yadavalli¹,

Mirishkar²,

Vuppala³

2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Previous research has found that Acoustic Models (AM) of an Automatic Speech Recognition (ASR) system are susceptible to dialect variations within a language, thereby adversely affecting the ASR. To counter this, researchers have proposed to build a dialectspecific AM while keeping the Language Model (LM) constant for all the dialects. This study explores the effect of dialect mismatched LM by considering three different Telugu regional dialects: Telangana, Coastal Andhra, and Rayalaseema. We show that dialect variations that surface in the form of a different lexicon, grammar, and occasionally semantics can significantly degrade the performance of the LM under mismatched conditions. Therefore, this degradation has an adverse effect on the ASR even when dialect-specific AM is used. We show a degradation of up to 13.13 perplexity points when LM is used under mismatched conditions. Furthermore, we show a degradation of over 9% and over 15% in Character Error Rate (CER) and Word Error Rate (WER), respectively, in the ASR systems when using mismatched LMs over matched LMs.

show abstract

“…However, multi-dialect ASR is an attractive solution in scenarios where sufficient dialect-specific data or information is not available. Therefore, Liu and Fung (2006); Rao and Sak (2017); Jain et al (2018); Yang et al (2018); Fukuda et al (2018); ; Viglino et al (2019); ; Deng et al (2021) attempt to improve multi-dialect ASR systems. Liu and Fung (2006) use auxiliary accent trees to model Chinese accent variation.…”

Section: Introductionmentioning

confidence: 99%

“…propose a Transformer-based encoder to simultaneously detect the dialect and transcribe an audio sample. More recently, with increased interest in self-supervised learning, Deng et al (2021) explored self-supervised learning techniques to predict the accent from speech and use the predicted information to train an accentspecific self-supervised ASR. They report that such a model significantly outperforms an accentindependent ASR system.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

2022

View full text Add to dashboard Cite

Systematicity is thought to be a key inductive bias possessed by humans that is lacking in standard natural language processing systems such as those utilizing transformers. In this work, we investigate the extent to which the failure of transformers on systematic generalization tests can be attributed to a lack of linguistic abstraction in its attention mechanism. We develop a novel modification to the transformer by implementing two separate input streams: a role stream controls the attention distributions (i.e., queries and keys) at each layer, and a filler stream determines the values. Our results show that when abstract role labels are assigned to input sequences and provided to the role stream, systematic generalization is improved. ReferencesMarco Baroni. 2020. Linguistic generalization and compositionality in modern artificial neural networks.

show abstract

Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning

Cited by 17 publications

References 0 publications

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

Exploring the Effect of Dialect Mismatched Language Models in Telugu Automatic Speech Recognition

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Contact Info

Product

Resources

About