2020
DOI: 10.48550/arxiv.2010.14233
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

Abstract: Non-autoregressive models greatly improve decoding speed over typical sequence-tosequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformerbas… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 9 publications
0
3
0
Order By: Relevance
“…However, there is a disadvantage of the AR models in that the inference time linearly increases with the output length. Recently, nonautoregressive (NAR) models has gained more and more attention in sequence-to-sequence tasks, including machine translation [5][6][7], speech recognition (ASR) [1,[8][9][10][11], and speech translation [12]. In contrast to the AR modeling, NAR modeling can predict the output tokens concurrently, the inference speed of which is dramatically faster than AR modeling.…”
Section: Introductionmentioning
confidence: 99%
“…However, there is a disadvantage of the AR models in that the inference time linearly increases with the output length. Recently, nonautoregressive (NAR) models has gained more and more attention in sequence-to-sequence tasks, including machine translation [5][6][7], speech recognition (ASR) [1,[8][9][10][11], and speech translation [12]. In contrast to the AR modeling, NAR modeling can predict the output tokens concurrently, the inference speed of which is dramatically faster than AR modeling.…”
Section: Introductionmentioning
confidence: 99%
“…To accelerate the inference, non-autoregressive transformers (NAT) were proposed for the parallel generation of the output sequence. The idea is widely adopted in neural machine translation (NMT) [4][5][6], automatic speech recognition (ASR) [7][8][9][10][11][12][13][14][15][16][17][18], text-to-speech (TTS) [19,20] and speech translation [21].…”
Section: Introductionmentioning
confidence: 99%
“…In addition, Fujita et al used the idea of the insertion transformer from NMT to generate the output sequence with an arbitrary order [12]. Another recent effective method is using multiple decoders as refiners to do an iterative refinement based on CTC alignments [14]. Theoretically, the iterative NAT has a limited improvement of inference speed since multiple iterations are still needed to obtain a competitive result.…”
Section: Introductionmentioning
confidence: 99%