“…Early DLbased methods use Convolutional Neural Networks (CNNs) to extract features and then apply recurrent neural networks for temporal modeling [46,58,80,95]. Since learning interperson interactions is essential for GAR [97], much of the research explores how to capture the actor relations [4,36,40,72,96]. Several works tackle this problem from a graphbased perspective [40,63,100,101] such as applying Graph Convolutional Networks (GCNs) [49,96].…”