“…For video representation learning, a large number of supervised learning methods have been proposed and received increasing attention, which relies on robust modeling and feature representation in videos. The methods include traditional methods [22,19,47,48,34,38,30,27] and deep learning methods [40,43,51,28,44,52,61,25,29,26]. To model and discover temporal knowledge in videos, twostream CNNs [40] judged the video image (spatial) and dense optical flow (temporal) separately, then directly fused the class scores of these two networks to obtain the classification result.…”