“…18,19 Among the different neural-based architectures, recurrent neural networks (RNNs), which are specially designed to handle sequential data with variable length, have achieved promising performances in 3D action recognition. 20,21 For example, Liu et al 13 proposed a long short-term memory (LSTM) network incorporating a tree structure to describe the relation of human parts, which successfully utilizes the spatiotemporal characteristics of human actions for the recognition task and achieves desirable accuracy on a large data set, ie, NTU RGB+D. 22 Following on the thought of the modeling relationship of two concurrent domains, ie, spatial and temporal, Hu et al 23 proposed a deep bilinear framework to further describe such relationship, where their proposed modality pooling layer and temporal pooling layer could pool the input action sequence along the modality and temporal directions separately.…”