“…Similar to [2], Feichtenhofer et al [15] showed that a twostream fusion at an intermediate layer using RGB images and a stack of ten optical flow frames can improve the performance with less parameters. Extensions of two stream networks include Two-stream ConvNet(original) [2], Two-stream Con-vPooling [44], TDD+FV [13], Two-stream Transformations [51], Two-stream ResNet [15], TSN (3 modalities) [14], KVMF [52], ST-ResNet [15], AdaScan [53], Three-stream sDTD [54], ST-VLMPF [17], SPN (BN-Inception) [55], and ActionVLAD [56]. Despite the good performance of multistream framework, it still remains unclear whether the deep learning based model can capture the subtle motion model and long-term motion dynamics for good performance without multi-stream fusion.…”