“…Action recognition is fundamental in video-based tasks with many approaches proposed [20], [21], [22], [23], [24], [25], [26], [27], [28], [29] and datasets [30], [31], [18], [17], [19], [32], [33], [34]. We notice that there is also a trend for more fine-grained action understanding, from video classification [20], [21] to spatial-temporal action detection [32], [35], [36], [14], and human-part level action recognition [15].…”