“…Training compact deep neural networks (DNNs) (Howard et al, 2017) efficiently has become an appealing topic be-cause of the increasing demand for deploying DNNs on resource-limited devices such as mobile phones and drones (Moskalenko et al, 2018). Recently, a large number of approaches have been proposed for training lightweight DNNs with the help of a cumbersome, over-parameterized model, such as network pruning (Li et al, 2016;He et al, 2019;Wang et al, 2021), quantization (Han et al, 2015), factorization (Jaderberg et al, 2014), and knowledge distillation (KD) (Hinton et al, 2015;Phuong & Lampert, 2019;Jin et al, 2020;Yun et al, 2020;Passalis et al, 2020;Wang, 2021). Among all these approaches, knowledge distillation is a popular scheme with which a compact student network is trained by mimicking the softmax output (class probabilities) of a pre-trained deeper and wider teacher model (Hinton et al, 2015).…”