2019
DOI: 10.48550/arxiv.1904.13377
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Very Deep Self-Attention Networks for End-to-End Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 22 publications
(29 citation statements)
references
References 0 publications
0
29
0
Order By: Relevance
“…The requirement is that the model represents SRC audio and SRC text in a similar way so that it can leverage the ASR and MT tasks learnt during training to perform the ST task. We build upon the deep Transformer [10] for speech proposed in [11]. To enable encoding audio and text jointly, we share the model parameters between the two modalities.…”
Section: Multi-task Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…The requirement is that the model represents SRC audio and SRC text in a similar way so that it can leverage the ASR and MT tasks learnt during training to perform the ST task. We build upon the deep Transformer [10] for speech proposed in [11]. To enable encoding audio and text jointly, we share the model parameters between the two modalities.…”
Section: Multi-task Modelmentioning
confidence: 99%
“…Our models use the Transformer architecture with attentionbased encoder and decoder [10,11]. We adapt the hyperparameters choices in [14] to our multi-modal setting: 32 audio encoder layers, 12 text encoder layers, 12 decoder layers; the 12 text encoder layers are shared with the top 12 audio encoder layers.…”
Section: Model Configurationsmentioning
confidence: 99%
“…In order for zero-shot to work, it is necessary that the model represents SRC audio and SRC text in a similar way so that it can leverage the ASR and MT tasks learnt during training to perform the ST task during inference. We use the Transformer architecture as described in [14] and [15], with the attention-based encoder and decoder. We extend it by including two parallel encoders, one for text input and one for audio input to fit our multi-modality training data.…”
Section: Zero-shot Speech Translationmentioning
confidence: 99%
“…Our models use the Transformer architecture with attentionbased encoder and decoder [14,15]. For single-task models, we closely follow the hyperparameter choices in [18].…”
Section: B Model Configurationsmentioning
confidence: 99%
“…Speech Translation (ST) systems are intended to generate text in target language from the audio in source language. The conventional ST systems are cascade ones, including (in the most popular form) three blocks i.e., an ASR, punctuation/segmentation module and an MT model (Ngoc-Quan Pham, 2019;Pham et al, 2020b;Ansari et al, 2020). Both Automatic Speech Recognition system (ASR) and Machine Translation (MT) models are independently trained, and the MT model processes the ASR output text (ASR hypotheses) to generate translations.…”
Section: Introductionmentioning
confidence: 99%