“…14 The principle behind Transformer and its variants is the self-attention mechanism under a general encoder-decoder framework. Munyer et al 15 present a self-supervised foreign object detection and lolcalization using vision Transformer (ViT) backbone. 16 Although Transformer models have been investigated in foreign object detection and localization due to it's capability to model intricate relationships among image patches or pixels across the image, its performance heavily rely on huge amount of training data.…”