“…A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25]. Due to the impressive performance, Transformer has also been introduced for low-level vision tasks [5,54,37,29,3,62,28,26].…”