“…First, we notice that using the spatial encoder to implicitly extract intra-frame contexts yields a very small benefit. Since the global scene context is useful for vision-related tasks (Wang et al, 2019;Zhang et al, 2021;Ji et al, 2022), we extend the spatial encoder to explicitly generate a global feature vector for each frame. Inspired by Vision Transformer (ViT) (Dosovitskiy et al, 2021), we prepend a learnable class token to the spatial encoder input, which captures the global relationship among all human-object pairs at a particular moment.…”