Proceedings of the Second Workshop on Insights From Negative Results in NLP 2021
DOI: 10.18653/v1/2021.insights-1.10
|View full text |Cite
|
Sign up to set email alerts
|

Recurrent Attention for the Transformer

Abstract: In this work, we conduct a comprehensive investigation on one of the centerpieces of modern machine translation systems: the encoderdecoder attention mechanism. Motivated by the concept of first-order alignments, we extend the (cross-)attention mechanism by a recurrent connection, allowing direct access to previous attention/alignment decisions. We propose several ways to include such a recurrency into the attention mechanism. Verifying their performance across different translation tasks we conclude that thes… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 7 publications
0
2
0
Order By: Relevance
“…Ref. [23] extends the (cross-)attention mechanism by a recurrent connection to allow direct access to previous alignment decisions, which incorporates several structural biases to improve the attention-based model by involving Markov conditions, fertility, and consistency in the direction of translation [24,25]. Refs.…”
Section: Related Workmentioning
confidence: 99%
“…Ref. [23] extends the (cross-)attention mechanism by a recurrent connection to allow direct access to previous alignment decisions, which incorporates several structural biases to improve the attention-based model by involving Markov conditions, fertility, and consistency in the direction of translation [24,25]. Refs.…”
Section: Related Workmentioning
confidence: 99%
“…There is a work [25] that tried to introduce the coverage machenism in transformer decoder. They directly used the coverage mechanism in RNN to the transformer, which greatly hurts its parallelism and training efficiency.…”
Section: Coverage Mechanismmentioning
confidence: 99%