2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00554
|View full text |Cite
|
Sign up to set email alerts
|

Visual Semantic Role Labeling for Video Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(28 citation statements)
references
References 60 publications
0
23
0
Order By: Relevance
“…CAD-120 (Koppula et al, 2013) is annotated with object affordances. V-COCO (Sadhu et al, 2021), as an extension of the widely used MS-COCO (Lin et al, 2014), added visual semantic role labels; the provided bounding box annotations also enabled HOI spatial detection. HICO later received an update in the form of HICO-DET (Chao et al, 2018), similarly incorporating bounding boxes.…”
Section: Action Triplet Datasetsmentioning
confidence: 99%
“…CAD-120 (Koppula et al, 2013) is annotated with object affordances. V-COCO (Sadhu et al, 2021), as an extension of the widely used MS-COCO (Lin et al, 2014), added visual semantic role labels; the provided bounding box annotations also enabled HOI spatial detection. HICO later received an update in the form of HICO-DET (Chao et al, 2018), similarly incorporating bounding boxes.…”
Section: Action Triplet Datasetsmentioning
confidence: 99%
“…Wei et al [74] and Cho et al [16] introduce new models that depart from the typical two-stage classification pipeline to better model event attribute relationships. Cho et al [17] incorporate transformers in the original architecture, Sadhu et al [60] apply the framework to video understanding, and Dehkordi et al [20] alternatively use a CNN ensembling method. In all of these approaches, it is assumed that the necessary elements to identify the event are clearly depicted in the image, and it is not explored how these models perform when presented with ambiguous data.…”
Section: Situation Recognition and Verb Predictionmentioning
confidence: 99%
“…State-of-the-art models We compare with the state-ofthe-art model in (Sadhu et al 2021), which contains two variants, including I3D (Carreira and Zisserman 2017a) and SlowFast (Feichtenhofer et al 2019). For all baselines, we consider the variant with Non-Local blocks (Wang et al 2018), which has been proved to be more effective according to VidSitu (Sadhu et al 2021). We provide their performance reported in the paper.…”
Section: Baselinesmentioning
confidence: 99%
“…Event Extraction. Extracting events from images/videos (Yatskar, Zettlemoyer, and Farhadi 2016;Pratt et al 2020;Sadhu et al 2021), texts (Ji and Grishman 2008;Wang et al 2019;Liu et al 2020;Lin et al 2020b), or multimedia (Li et al 2020;Chen et al 2021;Wen et al 2021;Li et al 2022) has attracted extensive research efforts. One of the key challenges in event extraction is to model the structural nature (Wang et al 2019;Li et al 2020) of events and their associated argument roles.…”
Section: Related Workmentioning
confidence: 99%