2020
DOI: 10.48550/arxiv.2011.10185
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Abstract: Figure 1. Example of video frame extrapolation. Top is the extrapolated result, middle is the zoomed local details and bottom is the occlusion map computed with ground truth.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
28
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 21 publications
(34 citation statements)
references
References 42 publications
0
28
0
Order By: Relevance
“…Combination Of CNN And Transformer. ConvTransformer [52] mapped the input sequence to a feature map sequence using an encoder based on a multi-headed convolutional self-attentive layer, and then decoded the target synthetic frame from the feature map sequence using another deep network containing a multi-headed convolutional self-attentive layer. Conformer [53] relied on Feature Coupling Unit (FCU) to interactively fuse local and global feature representations at different resolutions.…”
Section: B Transformer In Visionmentioning
confidence: 99%
“…Combination Of CNN And Transformer. ConvTransformer [52] mapped the input sequence to a feature map sequence using an encoder based on a multi-headed convolutional self-attentive layer, and then decoded the target synthetic frame from the feature map sequence using another deep network containing a multi-headed convolutional self-attentive layer. Conformer [53] relied on Feature Coupling Unit (FCU) to interactively fuse local and global feature representations at different resolutions.…”
Section: B Transformer In Visionmentioning
confidence: 99%
“…Transformer [45] is an encoder-decoder neural network for sequence-to-sequence tasks, which has achieved many state-of-the-art results and further revolutionized NLP with the success of BERT [10]. The recently trendy visual transformer has shown that an end-to-end standard transformer can implement image classification and other vision tasks [4,30,24,54,56]. ViT [11] cuts the images into some non-overlapping patches and encodes the patches set as a token sequence, whose head is attached to a learn-able classification token.…”
Section: Transformermentioning
confidence: 99%
“…Introducing Convolution to Transformers. Convolutions have been used to change the Transformer block in NLP and 2D image recognition, either by replacing multi-head attentions with convolution [48] or adding more convolution layers to capture local correlations [52,26,49]. Different from all the previous works, we propose convolution operation (i.e., EdgeConv [46]) solely on query features to summarize local responses from unordered 3D points to generate global geometric representations, of which the purpose is totally opposite to [26,49].…”
Section: Related Workmentioning
confidence: 99%