“…Attention mechanisms. As argued in [76], spatial deformation modeling methods [28,37,10,48], including VTNs, can be viewed as hard attention mechanisms, in that they localize and attend to the discriminative image parts. Attention mechanisms in neural networks have quickly gained popularity in diverse computer vision and natural language processing tasks, such as relational reasoning among objects [4,52], image captioning [67], neural machine translation [3,61], image generation [68,71], and image recognition [23,63].…”