End-to-End Video Instance Segmentation with Transformers

Wang, Yuqing; Xu, Zhaoliang; Wang, Xinlong; Shen, Chunhua; Cheng, Baoshan; Shen, Hao; Xia, Huaxia

doi:10.1109/cvpr46437.2021.00863

Cited by 507 publications

(288 citation statements)

References 17 publications

Supporting

Mentioning

261

Contrasting

Order By: Relevance

“…In Transformer-based models, a large amount of works use CNNs as either the encoder [29], [30], [31], [32], [33] or decoder [34], [35], [36], [37], [38] to capture the fine details.…”

Section: Spatial Domain Learningmentioning

confidence: 99%

Joint Learning of Frequency and Spatial Domains for Dense Predictions

Jia¹,

Yao²

2022

Preprint

View full text Add to dashboard Cite

Current artificial neural networks mainly conduct the learning process in the spatial domain but neglect the frequency domain learning. However, the learning course performed in the frequency domain can be more efficient than that in the spatial domain. In this paper, we fully explore frequency domain learning and propose a joint learning paradigm of frequency and spatial domains. This paradigm can take full advantage of the preponderances of frequency learning and spatial learning; specifically, frequency and spatial domain learning can effectively capture global and local information, respectively. Exhaustive experiments on two dense prediction tasks, i.e., self-supervised depth estimation and semantic segmentation, demonstrate that the proposed joint learning paradigm can 1) achieve performance competitive to those of state-of-the-art methods in both depth estimation and semantic segmentation tasks, even without pretraining; and 2) significantly reduce the number of parameters compared to other state-of-the-art methods, which provides more chance to develop real-world applications. We hope that the proposed method can encourage more research in cross-domain learning.

show abstract

“…In Transformer-based models, a large amount of works use CNNs as either the encoder [29], [30], [31], [32], [33] or decoder [34], [35], [36], [37], [38] to capture the fine details.…”

Section: Spatial Domain Learningmentioning

confidence: 99%

Joint Learning of Frequency and Spatial Domains for Dense Predictions

Jia¹,

Yao²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, pioneer works such as ViT [50] and DETR [76], proposed to utilize transformers to solve vision problems, by representing images as sequences of patches. It has been shown that transformers are effective in tasks such as image classification [51,77], object detection [76], semantic/instance segmentation [45], and video segmentation [78]. Specially, ViT [50] proposed to cut the image into patches, which are then converted to sequences of features and used as inputs to the standard transformers.…”

Section: Vision Transformermentioning

confidence: 99%

Boosting Crowd Counting with Transformers

Sun¹,

Liu²,

Probst³

et al. 2021

Preprint

View full text Add to dashboard Cite

Significant progress on the crowd counting problem has been achieved by integrating larger context into convolutional neural networks (CNNs). This indicates that global scene context is essential, despite the seemingly bottom-up nature of the problem. This may be explained by the fact that context knowledge can adapt and improve local feature extraction to a given scene. In this paper, we therefore investigate the role of global context for crowd counting. Specifically, a pure transformer is used to extract features with global information from overlapping image patches. Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches throughout transformer layers. Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions, we propose a tokenattention module (TAM) to recalibrate encoded features through channel-wise attention informed by the context token. Beyond that, it is adopted to predict the total person count of the image through regression-token module (RTM). Extensive experiments demonstrate that our method achieves state-of-the-art performance on various datasets, including ShanghaiTech, UCF-QNRF, JHU-CROWD++ and NWPU. On the large-scale JHU-CROWD++ dataset, our method improves over the previous best results by 26.9% and 29.9% in terms of MAE and MSE, respectively. Code and models will be available.Preprint. Under review.

show abstract

“…Transformer [47] is an effective sequence-to-sequence modeling network, and it has achieved stateof-the-art results in NLP tasks with the success of BERT [15]. Due to its great success, it has also be exploited in computer vision community, and the 'CNN + Transformer' becomes a popular paradigm [3,49,7,62,31,32,21]. ViT [16] leads the other trend to use pure transformer for vision tasks [23,30,54] by dividing the images into patch embedding sequences and feeding them into standard transformers.…”

Section: Transformer For Visionmentioning

confidence: 99%

KVT: k-NN Attention for Boosting Vision Transformers

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected selfattention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that k-NN attention is powerful in distilling noise from input tokens and in speeding up training. Extensive experiments are conducted by using ten different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance.

show abstract

End-to-End Video Instance Segmentation with Transformers

Cited by 507 publications

References 17 publications

Joint Learning of Frequency and Spatial Domains for Dense Predictions

Joint Learning of Frequency and Spatial Domains for Dense Predictions

Boosting Crowd Counting with Transformers

KVT: k-NN Attention for Boosting Vision Transformers

Contact Info

Product

Resources

About