2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01606
|View full text |Cite
|
Sign up to set email alerts
|

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Abstract: Dynamic scene graph generation aims at generating a scene graph of the given video. Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation. In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
114
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 93 publications
(117 citation statements)
references
References 61 publications
0
114
0
Order By: Relevance
“…Works using a hybrid ConvLSTM [149] are also found [62], [123]. Finally, in some instances, networks pre-trained to perform an auxiliary task (regarded as experts) are used to pre-process the input and provide specific information that can be leveraged by the Transformer [66], [131]. Some examples include object detection [80], action features [13], or scene, motion, OCR and facial features, among others [104].…”
Section: Embeddingmentioning
confidence: 99%
See 2 more Smart Citations
“…Works using a hybrid ConvLSTM [149] are also found [62], [123]. Finally, in some instances, networks pre-trained to perform an auxiliary task (regarded as experts) are used to pre-process the input and provide specific information that can be leveraged by the Transformer [66], [131]. Some examples include object detection [80], action features [13], or scene, motion, OCR and facial features, among others [104].…”
Section: Embeddingmentioning
confidence: 99%
“…Local restriction approaches reduce computational complexity from O(T 2 ) to O(T • N ), where N is the size of the neighborhood. One set of works [9], [99], [99], [119], [131], define the neighborhoods by sampling nearby tokens given a query, similar to the sliding window approach in the NLP Longformer [153]. Importantly, in [99], the [CLS] token does perform all-to-all attention.…”
Section: Restricted Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…The source code is made publicly available on Github. Now many models [32], [33], [34], [35], [36], [37] are available to generate scene graphs from different perspectives, and some works even extend the scene graph generation task from images to videos [38], [39], [40], [41]. Two-stage methods following [2] are currently dominating scene graph generation: several works [9], [32], [42], [43] use residual neural networks with the global context to improve the quality of the generated scene graphs.…”
Section: Scene Graph Generationmentioning
confidence: 99%
“…Its encoder-decoder configuration and attention mechanism are also used to solve various computer vision tasks in different ways, e.g. object detection [18], human-object interaction (HOI) detection [61], and dynamic scene graph generation [39].…”
Section: Transformer and Set Predictionmentioning
confidence: 99%