“…1. Crosses denote models based on Transformer architecture (Audio-MAE [19], HTS-AT [18], PaSST-S [17], PaSST-S-L [17], AST [16], KD-AST [10]) and circles denote models based on CNNs (PSLA [2], ERANN-1-6 [3], Wavegramlogmel-CNN [1], CNN14 [1], KD-CNN [10], MobileNets [7] -ours).…”