“…Although the size of BERT family models is usually smaller than the GPT family, compressing the BERT family has been investigated much more in the literature (e.g. DistilBERT (Sanh et al, 2019), TinyBERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), ALP-KD (Passban et al, 2021), MATE-KD (Rashid et al, 2021), Annealing-KD (Jafari et al, 2021) and BERTQuant (Zhang et al, 2020)). On the other hand, to the best of our knowledge, the GPT family has barely a handful of compressed models, among them the DistilGPT2 1 model is very prominent.…”