Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks

Zhang, Biao; Xiong, Deyi; Su, Jinsong; Lin, Qian; Zhang, Huiji

doi:10.18653/v1/d18-1459

Cited by 13 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we explain the used datasets, model architectures, optimization details and evaluation metrics in our experiments. All implementations are based on the zero 2 toolkit (Zhang et al, 2018 Regarding audio preprocessing, we use the given audio segmentation (train/dev/test) for experiments. We extract 40-dimensional log-Mel filterbanks with 2 https://github.com/bzhangGo/zero a step size of 10ms and window size of 25ms as the acoustic features, followed by feature expansion via second-order derivatives and mean-variance normalization.…”

Section: Experimental Settingsmentioning

confidence: 99%

Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021

Zhang¹,

Sennrich²

2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

This paper describes Edinburgh's submissions to the IWSLT2021 multilingual speech translation (ST) task. We aim at improving multilingual translation and zero-shot performance in the constrained setting (without using any extra training data) through methods that encourage transfer learning and larger capacity modeling with advanced neural components. We build our end-to-end multilingual ST model based on Transformer, integrating techniques including adaptive speech feature selection, language-specific modeling, multitask learning, deep and big Transformer, sparsified linear attention and root mean square layer normalization. We adopt data augmentation using machine translation models for ST which converts the zero-shot problem into a zero-resource one. Experimental results show that these methods deliver substantial improvements, surpassing the official baseline by > 15 average BLEU and outperforming our cascading system by > 2 average BLEU. Our final submission achieves competitive performance (runner up). 1

show abstract

Section: Experimental Settingsmentioning

confidence: 99%

Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021

Zhang¹,

Sennrich²

2021

Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We also compare our approach with the Averaged Attention Network (AAN) decoder (Zhang et al, 2018a), LN-LSTM and the Additionsubtraction Twin-gated Recurrent (ATR) network (Zhang et al, 2018b) on the WMT 14 En-De task.…”

Section: Resultsmentioning

confidence: 99%

“…LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al, 2014) are the most popular recur-rent models. To accelerate RNN models, Zhang et al (2018b) propose a heavily simplified ATR network to have the smallest number of weight matrices among units of all existing gated RNNs. Peter et al (2016) investigate exponentially decaying bag-of-words input features for feedforward NMT models.…”

Section: Related Workmentioning

confidence: 99%

Multi-Head Highly Parallelized LSTM Decoder for Neural Machine Translation

Liu

Genabith

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

One of the reasons Transformer translation models are popular is that self-attention networks for context modelling can be easily parallelized at sequence level. However, the computational complexity of a self-attention network is O(n 2 ), increasing quadratically with sequence length. By contrast, the complexity of LSTM-based approaches is only O(n). In practice, however, LSTMs are much slower to train than self-attention networks as they cannot be parallelized at sequence level: to model context, the current LSTM state relies on the full LSTM computation of the preceding state. This has to be computed n times for a sequence of length n. The linear transformations involved in the LSTM gate and state computations are the major cost factors in this. To enable sequence-level parallelization of LSTMs, we approximate full LSTM context modelling by computing hidden states and gates with the current input and a simple bag-of-words representation of the preceding tokens context. This allows us to compute each input step efficiently in parallel, avoiding the formerly costly sequential linear transformations. We then connect the outputs of each parallel step with computationally cheap element-wise computations. We call this the Highly Parallelized LSTM. To further constrain the number of LSTM parameters, we compute several small HPLSTMs in parallel like multi-head attention in the Transformer. The experiments show that our MHPLSTM decoder achieves significant BLEU improvements, while being even slightly faster than the self-attention network in training, and much faster than the standard LSTM.

show abstract

“…Our goal is to design a more concise deep learning model on the basis of ensuring accuracy and make it easier to be deployed on IoT micro‐controllers with limited resources, which are so small that they are not suitable for storing other complex recurrent network models. Taking the standard RNN, LSTM, and GRU models as the benchmark, and combining the inspiration obtained from the addition‐subtraction twin‐gated recurrent (ATR) cell proposed in the literature [21], we propose a more concise and smaller gated recurrent cell for PQD detection.…”

Section: The Proposed Sgrn Methodsmentioning

confidence: 99%

“…In the design process, we get inspiration from the ATR structure proposed in [21], and add a mechanism similar to self-attention to the proposed recurrent cell. We will show the analysis of SGRN by decomposing the recurrent structure, which can be explained by expanding Equation (4).…”

Section: Analysis Of Sgrn Structurementioning

confidence: 99%

A simple gated recurrent network for detection of power quality disturbances

Wei

2020

IET Generation Trans & Dist

View full text Add to dashboard Cite

This paper presents a new concise deep learning-based sequence model to detect the power quality disturbances (PQD), which only uses original signals and does not require pre-processing and complex artificial feature extraction process. A simple gated recurrent network (SGRN) with a new recurrent cell structure is developed, which consists of only two gates: forget gate and input gate, and two weight matrices. Compared with the standard Recurrent Neural Network (RNN) model, the training process of the proposed method is more stable and the prediction accuracy is higher. In addition, this special structure retains basic non-linearity and long-term memory, while enabling the simple gated recurrent network model to be superior to Long Short-Term Memory (LSTM) Network and Gated Recurrent Unit (GRU) Network in terms of the number of parameters (i.e. memory cost) and detection speed. In the light of the experimental results, the simple gated recurrent network algorithm can achieve 99.07% detection accuracy, and contains only 18,959 parameters, which indicates that our proposed method is easier to deploy in resource-constrained internet of things (IoT) micro-controllers.

show abstract

Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks

Cited by 13 publications

References 29 publications

Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021

Edinburgh’s End-to-End Multilingual Speech Translation System for IWSLT 2021

Multi-Head Highly Parallelized LSTM Decoder for Neural Machine Translation

A simple gated recurrent network for detection of power quality disturbances

Contact Info

Product

Resources

About