In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
k e V and the (1 0 ") 3055-keV levels, betw een the ( ll~) 3 2 4 0 -k e V and the 11<+) 3 1 19-keV levels, and betw een the l l (+) 3 1 19-keV and the (10" ) 3055-keV levels are not given. T he energies are 285, 120, and 65 keV, respectively, as show n in the corrected level schem e below. A rrow sym bols are added to the lines denoting the 258-keV y ray betw een the (1 4 " )46 9 8 -k eV and the (1 4 " )4440-keV levels, the 193-keV y ray betw een the (1 4~)4 440-keV and the (13" )4247-keV levels, and the 345-keV y ray betw een the (13" ) 4247-keV and the (13" ) 3902-keV levels.T he corrections do not affect the results and conclusions o f the original paper.
Reading comprehension (RC) has been studied in a variety of datasets with the boosted performance brought by deep neural networks. However, the generalization capability of these models across different domains remains unclear. To alleviate the problem, we investigate unsupervised domain adaptation on RC, wherein a model is trained on the labeled source domain and to be applied to the target domain with only unlabeled samples. We first show that even with the powerful BERT contextual representation, a model can not generalize well from one domain to another. To solve this, we provide a novel conditional adversarial self-training method (CASe). Specifically, our approach leverages a BERT model fine-tuned on the source dataset along with the confidence filtering to generate reliable pseudo-labeled samples in the target domain for self-training. On the other hand, it further reduces domain distribution discrepancy through conditional adversarial learning across domains. Extensive experiments show our approach achieves comparable performance to supervised models on multiple large-scale benchmark datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.