“…Recent model compression works fall under three general classes: Pruning which forces some weights or activations to zero [20-22, 26, 32, 34, 40, 47] combined with "zero-aware" memory encoding. Knowledge distillation distills a larger, "teacher" model into a smaller "student" model [1,17,24,27,29,37,43]. Lastly, quantization where the parameters and/or activations are quantized to shorter bit-widths [6,19,39,50,51,54].…”