“…Transformer is firstly proposed by [63] for machine translation. Recently, Transformer has achieved great success in high-level vision, such as image classification [1,16,17,44,69], semantic segmentation [7,44,69,80], human pose estimation [5,6,39,41,46,70], object detection [9,14,30,44,82], etc. Due to the advantage of capturing long-range dependencies and excellent performance in many high-level vision tasks, Transformer has also been introduced into low-level vision [8,10,42,67].…”