Relative Positional Encoding for Speech Recognition and Direct Translation

Pham, Ngoc-Quan; Ha, Thanh-Le; Nguyen, Tuan-Nam; Nguyen, Thai-Son; Salesky, Elizabeth; Stüker, Sebastian; Niehues, Jan; Waibel, Alexander

doi:10.21437/interspeech.2020-2526

Cited by 24 publications

(30 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Modeling The main architecture is the deep Transformer (Vaswani et al, 2017) with stochastic layers (Pham et al, 2019b). The encoder self attention layer uses Bidirectional relative attention (Pham et al, 2020a) which models the relative distance between one position and other positions in the sequence. This modeling is bidirectional because the distance is distinguished for each direction from the perspective of one particular position.…”

Section: End-to-end Modelmentioning

confidence: 99%

KIT’s IWSLT 2021 Offline Speech Translation System

Nguyen¹,

Nguyen²,

Huber³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

This paper describes KIT'submission to the IWSLT 2021 Offline Speech Translation Task. We describe a system in both cascaded condition and end-to-end condition. In the cascaded condition, we investigated different endto-end architectures for the speech recognition module. For the text segmentation module, we trained a small transformer-based model on high-quality monolingual data. For the translation module, our last year's neural machine translation model was reused. In the end-toend condition, we improved our Speech Relative Transformer architecture to reach or even surpass the result of the cascade system.

show abstract

Section: End-to-end Modelmentioning

confidence: 99%

KIT’s IWSLT 2021 Offline Speech Translation System

Nguyen¹,

Nguyen²,

Huber³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Due to the non-sequential modeling of the original self attention modules, the vanilla Transformer employs the position embedding by a deterministic sinusoidal function to indicate the absolute position of each input element (Vaswani et al, 2017). However, this scheme is far from ideal for acoustic modeling (Pham et al, 2020 The latest work (Pham et al, 2020;Gulati et al, 2020) points out that the relative position encoding enables the model to generalize better for the unseen sequence lengths. It yields a significant improvement on the acoustic modeling tasks.…”

Section: Relative Position Encodingmentioning

confidence: 99%

The NiuTrans End-to-End Speech Translation System for IWSLT 2021 Offline Task

Xu¹,

Liu²,

Liu³

et al. 2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

View full text Add to dashboard Cite

This paper describes the submission of the Ni-uTrans end-to-end speech translation system for the IWSLT 2021 offline task, which translates from the English audio to German text directly without intermediate transcription. We use the Transformer-based model architecture and enhance it by Conformer, relative position encoding, and stacked acoustic and textual encoding. To augment the training data, the English transcriptions are translated to German translations. Finally, we employ ensemble decoding to integrate the predictions from several models trained with the different datasets. Combining these techniques, we achieve 33.84 BLEU points on the MuST-C En-De test set, which shows the enormous potential of the end-to-end model.

show abstract

“…Relative positional encoding [12,13] is an extension of an absolute positional encoding technique that allows self-attention to handle relative positional information. The absolute positional encoding is defined as follows:…”

Section: Positional Encodingmentioning

confidence: 99%

“…To solve this problem, several studies have been proposed. Masking [11] limits the range of self-attention by using a Gaussian window, whereas relative positional encoding [12,13] uses relative embedding in a self-attention architecture to eliminate the effect of the length mismatch. However, masking does not take into account the correlation between input features and relative distance.…”

Section: Introductionmentioning

confidence: 99%

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Kashiwagi

Tsunoo

Watanabe

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the training data are usually divided into short segments for efficient training. To mitigate this mismatch, we propose a new architecture, which is a variant of the Gaussian kernel, which itself is a shift-invariant kernel. First, we mathematically demonstrate that self-attention with shared weight parameters for queries and keys is equivalent to a normalized kernel function. By replacing this kernel function with the proposed Gaussian kernel, the architecture becomes completely shift-invariant with the relative position information embedded using a frame indexing technique. The proposed Gaussian kernelized SA was applied to connectionist temporal classification (CTC) based ASR. An experimental evaluation with the Corpus of Spontaneous Japanese (CSJ) and TEDLIUM 3 benchmarks shows that the proposed SA achieves a significant improvement in accuracy (e.g., from 24.0% WER to 6.0% in CSJ) in long sequence data without any windowing techniques.

show abstract

Relative Positional Encoding for Speech Recognition and Direct Translation

Cited by 24 publications

References 0 publications

KIT’s IWSLT 2021 Offline Speech Translation System

KIT’s IWSLT 2021 Offline Speech Translation System

The NiuTrans End-to-End Speech Translation System for IWSLT 2021 Offline Task

Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition

Contact Info

Product

Resources

About