Encoder Transfer for Attention-based Acoustic-to-word Speech Recognition

Ueno, Sei; Moriya, Takafumi; Mimura, Mamoru; Sakai, Shinsuke; Shinohara, Yusuke; Yamaguchi, Yoshikazu; Aono, Yushi; Kawahara, Tatsuya

doi:10.21437/interspeech.2018-1424

Cited by 7 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most adaptation technologies discussed in this paper can also be applied to domain adaptation [154], [232]- [235]. When the amount of adaptation data is limited, a common practice is adapting only partial layers of the network [236]. To let the adapted model still perform well on the source domain, Moriya et al [237] proposed progressive neural networks by adding an additional model column to the original model for each new domain and only update the new model column with the new domain data.…”

Section: Domain Adaptationmentioning

confidence: 99%

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Bell

Fainberg

Klejch

et al. 2021

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

Section: Domain Adaptationmentioning

confidence: 99%

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Bell

Fainberg

Klejch

et al. 2021

IEEE Open J. Signal Process.

View full text Add to dashboard Cite

“…In this paper, we use the attention-based encoder-decoder model [23][24][25] for an endto-end ASR system. Our implementation of the model is based on [4,8], and summarized in this section.…”

Section: Attention-based Encoder-decoder Model For Automatic Speech Rmentioning

confidence: 99%

“…Recent automatic speech recognition (ASR) systems can map acoustic features to word sequences directly; called acoustic-toword (A2W) end-to-end ASR, the approach is based on a fully neural network (FNN) -based architecture [1][2][3][4][5][6][7][8]. Unfortunately, end-to-end ASR systems are not robust to out-of-vocabulary (OOV) words because the number of NN outputs, which correspond to word entries, is fixed.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, there are many more characters in Japanese texts than, for example, graphemes in English and OOVs exist at the character-level in external language resources. This means that end-to-end ASR systems need to be re-trained using new data pairs of speech and text every time a new word/character (OOV) is used [7,8]. We want a framework that makes it easy to extend the vocabulary of FNNbased systems by using just text-based language resources.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Joint Maximization Decoder with Neural Converters for Fully Neural Network-Based Japanese Speech Recognition

Moriya

Wang²,

Tanaka

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

We present a novel fully neural network (FNN)-based automatic speech recognition (ASR) system that addresses the outof-vocabulary (OOV) problem. The most common approach to the OOV problem is leveraging character/sub-word level units as output symbols. Unfortunately, this approach is not suitable for Japanese and Mandarin Chinese since they have many more grapheme sets than English. Our solution is to develop FNN-based ASR that uses a pronunciation-based unit set with dictionaries, i.e., word-to-pronunciation rules. A previous study proposed, for Mandarin Chinese, a greedy cascading decoder (GCD) that uses two neural converters, acoustic-topronunciation (A2P) and pronunciation-to-word (P2W) conversion models. However, to generate optimal word sequences, the previous work considered just optimal pronunciation sequences. In this paper, we propose a joint maximization decoder (JMD) that considers the joint probability of pronunciation and word in beam-search decoding. Moreover, we introduce a neural network based joint source channel model for improving A2P conversion performance. Experiments on Japanese ASR tasks demonstrate that JMD achieves better performance than GCD. Furthermore, we show the effectiveness of using just language resources to retrain the P2W conversion model.

show abstract

“…However, the pruning algorithm is applied for the entire network, and there has been no investigation into the subnetwork-wise parameter freezing. Some studies on domain adaptation of ASR models have shown that updating only a part of the layers improves the performance on the target domain [16,17]. However, there has been no research on the performance against catastrophic forgetting.…”

Section: Introductionmentioning

confidence: 99%

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

Takashima¹,

Horiguchi²,

Watanabe³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model. Conventional approaches require extra parameters of the same size as the model for optimization, and it is difficult to apply these approaches to end-to-end ASR models because they have a huge amount of parameters. To solve this problem, we first investigate which parts of end-to-end ASR models contribute to high accuracy in the target domain while preventing catastrophic forgetting. We conduct experiments on incremental domain adaptation from the LibriSpeech dataset to the AMI meeting corpus with two popular end-to-end ASR models and found that adapting only the linear layers of their encoders can prevent catastrophic forgetting. Then, on the basis of this finding, we develop an element-wise parameter selection focused on specific layers to further reduce the number of fine-tuning parameters. Experimental results show that our approach consistently prevents catastrophic forgetting compared to parameter selection from the whole model.

show abstract

Encoder Transfer for Attention-based Acoustic-to-word Speech Recognition

Cited by 7 publications

References 20 publications

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

Joint Maximization Decoder with Neural Converters for Fully Neural Network-Based Japanese Speech Recognition

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

Contact Info

Product

Resources

About