2021
DOI: 10.48550/arxiv.2111.14576
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Recurrent Vision Transformer for Solving Visual Reasoning Problems

Abstract: Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 27 publications
0
1
0
Order By: Relevance
“…As an initial experiment, we attempted to train and test a Vision Transformer 1 (ViT) (Dosovitskiy et al, 2020) constrained to have a similar number of parameters (21 million) to the ResNet-50 used here. We were not able to get these architectures to do well on most of the tasks that are difficult for ResNets even with 100,000 samples (also shown in Messina, Amato, Carrara, Gennaro, & Falchi, 2021a). It is worth noting that even 100,000 samples remain a relatively small data set size by modern standards since ViT was trained from scratch.…”
Section: Discussionmentioning
confidence: 93%
“…As an initial experiment, we attempted to train and test a Vision Transformer 1 (ViT) (Dosovitskiy et al, 2020) constrained to have a similar number of parameters (21 million) to the ResNet-50 used here. We were not able to get these architectures to do well on most of the tasks that are difficult for ResNets even with 100,000 samples (also shown in Messina, Amato, Carrara, Gennaro, & Falchi, 2021a). It is worth noting that even 100,000 samples remain a relatively small data set size by modern standards since ViT was trained from scratch.…”
Section: Discussionmentioning
confidence: 93%