Recurrent Attention for the Transformer

Rosendahl, Jan; Herold, Christian; Petrick, Frithjof; Ney, Hermann

doi:10.18653/v1/2021.insights-1.10

Cited by 4 publications

(2 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ref. [23] extends the (cross-)attention mechanism by a recurrent connection to allow direct access to previous alignment decisions, which incorporates several structural biases to improve the attention-based model by involving Markov conditions, fertility, and consistency in the direction of translation [24,25]. Refs.…”

Section: Related Workmentioning

confidence: 99%

Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer

Im,

Chan

2024

Mathematics

View full text Add to dashboard Cite

The attention mechanism performs well for the Neural Machine Translation (NMT) task, but heavily depends on the context vectors generated by the attention network to predict target words. This reliance raises the issue of long-term dependencies. Indeed, it is very common to combine predicates with postpositions in sentences, and the same predicate may have different meanings when combined with different postpositions. This usually poses an additional challenge to the NMT study. In this work, we observe that the embedding vectors of different target tokens can be classified by part-of-speech, thus we analyze the Natural Language Processing (NLP) related Content-Adaptive Recurrent Unit (CARU) unit and apply it to our attention model (CAAtt) and embedding layer (CAEmbed). By encoding the source sentence with the current decoded feature through the CARU, CAAtt is capable of achieving translation content-adaptive representations, which attention weights are contributed and enhanced by our proposed L1expNx normalization. Furthermore, CAEmbed aims to alleviate long-term dependencies in the target language through partial recurrent design, performing the feature extraction in a local perspective. Experiments on the WMT14, WMT17, and Multi30k translation tasks show that the proposed model achieves improvements in BLEU scores and enhancement of convergence over the attention-based plain NMT model. We also investigate the attention weights generated by the proposed approaches, which indicate that refinement over the different combinations of adposition can lead to different interpretations. Specifically, this work provides local attention to some specific phrases translated in our experiment. The results demonstrate that our approach is effective in improving performance and achieving a more reasonable attention distribution compared to the state-of-the-art models.

show abstract

Section: Related Workmentioning

confidence: 99%

Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer

Im,

Chan

2024

Mathematics

View full text Add to dashboard Cite

show abstract

“…There is a work [25] that tried to introduce the coverage machenism in transformer decoder. They directly used the coverage mechanism in RNN to the transformer, which greatly hurts its parallelism and training efficiency.…”

Section: Coverage Mechanismmentioning

confidence: 99%

CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition

Zhao¹,

Gao²

2022

Preprint

View full text Add to dashboard Cite

The Transformer-based encoder-decoder architecture has recently made significant advances in recognizing handwritten mathematical expressions. However, the transformer model still suffers from the lack of coverage problem, making its expression recognition rate (ExpRate) inferior to its RNN counterpart. Coverage information, which records the alignment information of the past steps, has proven effective in the RNN models. In this paper, we propose CoMER, a model that adopts the coverage information in the transformer decoder. Specifically, we propose a novel Attention Refinement Module (ARM) to refine the attention weights with past alignment information without hurting its parallelism. Furthermore, we take coverage information to the extreme by proposing self-coverage and cross-coverage, which utilize the past alignment information from the current and previous layers. Experiments show that CoMER improves the ExpRate by 0.61%/2.09%/1.59% compared to the current state-of-the-art model, and reaches 59.33%/59.81%/62.97% on the CROHME 2014/2016/2019 test sets. 1

show abstract

CoMER: Modeling Coverage for Transformer-Based Handwritten Mathematical Expression Recognition

Zhao

Gao

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Recurrent Attention for the Transformer

Cited by 4 publications

References 7 publications

Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer

Neural Machine Translation with CARU-Embedding Layer and CARU-Gated Attention Layer

CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition

CoMER: Modeling Coverage for Transformer-Based Handwritten Mathematical Expression Recognition

Contact Info

Product

Resources

About