“…Vision Transformers treat input pixels as tokens and use self-attention operations to process interactions between these tokens. Inspired by the success of vision Transformers, many attempts have been made to employ Transformers for low-level vision tasks [10,14,15,46,63,68,71,75,78,79] During the development of these models, the noise pattern used for training is often consistent with the testing one. The factor that determines its denoising performance is the fitting ability of the network, in other words, the ability of the network to overfit to the training noise.…”