2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01378
|View full text |Cite
|
Sign up to set email alerts
|

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
81
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 99 publications
(108 citation statements)
references
References 21 publications
0
81
0
Order By: Relevance
“…Following ViT, many transformer-based architectures such as PCT [27], IPT [79], T2T-ViT [44], DeepViT [167], SETR [81], PVT [45], CaiT [168], TNT [82], Swin-transformer [46], Query2Label [83], MoCoV3 [84], BEiT [85], SegFormer [86], FuseFormer [169], and MAE [170] have appeared, with excellent results for many kind of visual tasks including image classification, object detection, semantic segmentation, point cloud processing, action recognition, and self-supervised learning.…”
Section: Vision Transformersmentioning
confidence: 99%
“…Following ViT, many transformer-based architectures such as PCT [27], IPT [79], T2T-ViT [44], DeepViT [167], SETR [81], PVT [45], CaiT [168], TNT [82], Swin-transformer [46], Query2Label [83], MoCoV3 [84], BEiT [85], SegFormer [86], FuseFormer [169], and MAE [170] have appeared, with excellent results for many kind of visual tasks including image classification, object detection, semantic segmentation, point cloud processing, action recognition, and self-supervised learning.…”
Section: Vision Transformersmentioning
confidence: 99%
“…All these works have consistently found promising generalization capabilities of Transformer architectures. Nevertheless, VTs are still a mystery with regards to this, and are limited to few works which have tested their model on OOD data [13], [62], [68], [69], [115], [126] or evaluated the learned features in other settings [50], [52], [71], [88]. While we expect them to follow the same trend as other modalities, further research is needed.…”
Section: The Road Aheadmentioning
confidence: 99%
“…Minimal embeddings. Inspired by the success of ViT [7], few video methods omit the use of large backbones and perform linear projections or convolutions instead, in order to embed tokens representing smaller portions of the input video [7], [9], [11], [88], [115], [130]. Empirical studies like [9], [130], show that stand-alone Transformers (i.e., without complex CNN backbones) are as performant as CNN counterparts, even at the expense of high computational and data resources.…”
Section: Embeddingmentioning
confidence: 99%
See 2 more Smart Citations