“…Learning discriminative image representation in an unsupervised/ self-supervised manner has attracted increasing interest (Agrawal, Carreira, and Malik 2015;Doersch, Gupta, and Efros 2015;Xie et al 2021), for it gets rid of the costly manually-labeled data and achieves promising performance on many down-stream tasks (Larsson et al 2019;Hung et al 2019;Doersch and Zisserman 2017). These methods generally design pretext tasks and learn the representation from the label generated by the tasks, such as rotation predicting (Komodakis and Gidaris 2018), jigsaw (Noroozi and Favaro 2016;Kim et al 2018), in-painting (Pathak et al 2016), colorization (Zhang, Isola, and Efros 2016;Larsson, Maire, and Shakhnarovich 2017) and clustering (Noroozi et al 2018;Caron et al 2018).…”