ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Liu, Zhouyong; Luo, Shun Nian; Li, Wubin; Lu, Jingben; Wu, Yufan; Li, Chunguo; Yang, Lu

doi:10.48550/arxiv.2011.10185

Cited by 21 publications

(34 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Combination Of CNN And Transformer. ConvTransformer [52] mapped the input sequence to a feature map sequence using an encoder based on a multi-headed convolutional self-attentive layer, and then decoded the target synthetic frame from the feature map sequence using another deep network containing a multi-headed convolutional self-attentive layer. Conformer [53] relied on Feature Coupling Unit (FCU) to interactively fuse local and global feature representations at different resolutions.…”

Section: B Transformer In Visionmentioning

confidence: 99%

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Dai,

Hu,

Zhuang

et al. 2022

Preprint

View full text Add to dashboard Cite

Cross-view geo-localization is a task of matching the same geographic image from different views, e.g., unmanned aerial vehicle (UAV) and satellite. The most difficult challenges are the position shift and the uncertainty of distance and scale. Existing methods are mainly aimed at digging for more comprehensive fine-grained information. However, it underestimates the importance of extracting robust feature representation and the impact of feature alignment. The CNN-based methods have achieved great success in cross-view geo-localization. However it still has some limitations, e.g., it can only extract part of the information in the neighborhood and some scale reduction operations will make some fine-grained information lost. In particular, we introduce a simple and efficient transformer-based structure called Feature Segmentation and Region Alignment (FSRA) to enhance the model's ability to understand contextual information as well as to understand the distribution of instances. Without using additional supervisory information, FSRA divides regions based on the heat distribution of the transformer's feature map, and then aligns multiple specific regions in different views one on one. Finally, FSRA integrates each region into a set of feature representations. The difference is that FSRA does not divide regions manually, but automatically based on the heat distribution of the feature map. So that specific instances can still be divided and aligned when there are significant shifts and scale changes in the image. In addition, a multiple sampling strategy is proposed to overcome the disparity in the number of satellite images and that of images from other sources. Experiments show that the proposed method has superior performance and achieves the state-of-the-art in both tasks of drone view target localization and drone navigation. Code will be released at https://github.com/Dmmm1997/FSRA

show abstract

Section: B Transformer In Visionmentioning

confidence: 99%

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Dai,

Hu,

Zhuang

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformer [45] is an encoder-decoder neural network for sequence-to-sequence tasks, which has achieved many state-of-the-art results and further revolutionized NLP with the success of BERT [10]. The recently trendy visual transformer has shown that an end-to-end standard transformer can implement image classification and other vision tasks [4,30,24,54,56]. ViT [11] cuts the images into some non-overlapping patches and encodes the patches set as a token sequence, whose head is attached to a learn-able classification token.…”

Section: Transformermentioning

confidence: 99%

ConTNet: Why not use convolution and transformer at the same time?

Yan¹,

Li²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (Convolution-Transformer Network), combining transformer with Con-vNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely dependent on a pile of data augmentations when trained from scratch on a midsize dataset (e.g., ImageNet1k), Con-TNet can be optimized like normal ConvNets (e.g., ResNet) and preserve an outstanding robustness. It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet. We present its superiority and effectiveness on image classification and downstream tasks. For example, our ConTNet achieves 81.8% top-1 accuracy on ImageNet which is the same as DeiT-B with less than 40% computational complexity. ConTNet-M also outperforms ResNet50 as the backbone of both Faster-RCNN (by 2.6%) and Mask-RCNN (by 3.2%) on COCO2017 dataset. We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design. The code will be released at https://github.com/yanhao-tian/ConTNet.

show abstract

“…Introducing Convolution to Transformers. Convolutions have been used to change the Transformer block in NLP and 2D image recognition, either by replacing multi-head attentions with convolution [48] or adding more convolution layers to capture local correlations [52,26,49]. Different from all the previous works, we propose convolution operation (i.e., EdgeConv [46]) solely on query features to summarize local responses from unordered 3D points to generate global geometric representations, of which the purpose is totally opposite to [26,49].…”

Section: Related Workmentioning

confidence: 99%

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Yu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

General point clouds have been increasingly investigated for different tasks, and recently Transformerbased networks are proposed for point cloud analysis. However, there are barely related works for medical point clouds, which are important for disease detection and treatment. In this work, we propose an attention-based model specifically for medical point clouds, namely 3D medical point Transformer (3DMedPT), to examine the complex biological structures. By augmenting contextual information and summarizing local responses at query, our attention module can capture both local context and global content feature interactions. However, the insufficient training samples of medical data may lead to poor feature learning, so we apply position embeddings to learn accurate local geometry and Multi-Graph Reasoning (MGR) to examine global knowledge propagation over channel graphs to enrich feature representations. Experiments conducted on IntrA dataset proves the superiority of 3DMedPT, where we achieve the best classification and segmentation results. Furthermore, the promising generalization ability of our method is validated on general 3D point cloud benchmarks: ModelNet40 and ShapeNetPart. Code 1 is released.

show abstract

ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Abstract: Figure 1. Example of video frame extrapolation. Top is the extrapolated result, middle is the zoomed local details and bottom is the occlusion map computed with ground truth.

Cited by 21 publications

References 42 publications

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

ConTNet: Why not use convolution and transformer at the same time?

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Contact Info

Product

Resources

About