Visual Semantic Role Labeling for Video Understanding

Sadhu, Arka; Gupta, Tanmay; Yatskar, Mark; Nevatia, Ram; Kembhavi, Aniruddha

doi:10.1109/cvpr46437.2021.00554

Cited by 30 publications

(28 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CAD-120 (Koppula et al, 2013) is annotated with object affordances. V-COCO (Sadhu et al, 2021), as an extension of the widely used MS-COCO (Lin et al, 2014), added visual semantic role labels; the provided bounding box annotations also enabled HOI spatial detection. HICO later received an update in the form of HICO-DET (Chao et al, 2018), similarly incorporating bounding boxes.…”

Section: Action Triplet Datasetsmentioning

confidence: 99%

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition

Nwoye¹,

Alapatt²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Context-aware decision support in the operating room can foster surgical safety and efficiency by leveraging real-time feedback from surgical workflow analysis. Most existing works recognize surgical activities at a coarse-grained level, such as phases, steps or events, leaving out fine-grained interaction details about the surgical activity; yet those are needed for more helpful AI assistance in the operating room. Recognizing surgical actions as triplets of instrument, verb, target combination delivers comprehensive details about the activities taking place in surgical videos. This paper presents CholecTriplet2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. The challenge granted private access to the large-scale CholecT50 dataset, which is annotated with action triplet information. In this paper, we present the challenge setup and assessment of the stateof-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods from the challenge organizers and 19 new deep learning algorithms by competing teams are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%. This study also analyzes the significance of the results obtained by the presented approaches, performs a thorough methodological comparison between them, in-depth result analysis, and proposes a novel ensemble method for enhanced recognition. Our analysis shows that surgical workflow analysis is not yet solved, and also highlights interesting directions for future research on fine-grained surgical activity recognition which is of utmost importance for the development of AI in surgery.

show abstract

Section: Action Triplet Datasetsmentioning

confidence: 99%

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition

Nwoye¹,

Alapatt²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Wei et al [74] and Cho et al [16] introduce new models that depart from the typical two-stage classification pipeline to better model event attribute relationships. Cho et al [17] incorporate transformers in the original architecture, Sadhu et al [60] apply the framework to video understanding, and Dehkordi et al [20] alternatively use a CNN ensembling method. In all of these approaches, it is assumed that the necessary elements to identify the event are clearly depicted in the image, and it is not explored how these models perform when presented with ambiguous data.…”

Section: Situation Recognition and Verb Predictionmentioning

confidence: 99%

Ambiguous Images With Human Judgments for Robust Visual Event Classification

Sanders¹,

Kriz²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Contemporary vision benchmarks predominantly consider tasks on which humans can achieve near-perfect performance. However, humans are frequently presented with visual data that they cannot classify with 100% certainty, and models trained on standard vision benchmarks achieve low performance when evaluated on this data. To address this issue, we introduce a procedure for creating datasets of ambiguous images and use it to produce SQUID-E ("Squidy"), a collection of noisy images extracted from videos. All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments. We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models. Experimental results suggest that existing vision models are not sufficiently equipped to provide meaningful outputs for ambiguous images and that datasets of this nature can be used to assess and improve such models through model training and direct evaluation of model calibration. These findings motivate large-scale ambiguous dataset creation and further research focusing on noisy visual data. 1 1 Dataset and code are available at https://katesanders9.github.io/ambiguous-images. 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.

show abstract

“…State-of-the-art models We compare with the state-ofthe-art model in (Sadhu et al 2021), which contains two variants, including I3D (Carreira and Zisserman 2017a) and SlowFast (Feichtenhofer et al 2019). For all baselines, we consider the variant with Non-Local blocks (Wang et al 2018), which has been proved to be more effective according to VidSitu (Sadhu et al 2021). We provide their performance reported in the paper.…”

Section: Baselinesmentioning

confidence: 99%

“…Event Extraction. Extracting events from images/videos (Yatskar, Zettlemoyer, and Farhadi 2016;Pratt et al 2020;Sadhu et al 2021), texts (Ji and Grishman 2008;Wang et al 2019;Liu et al 2020;Lin et al 2020b), or multimedia (Li et al 2020;Chen et al 2021;Wen et al 2021;Li et al 2022) has attracted extensive research efforts. One of the key challenges in event extraction is to model the structural nature (Wang et al 2019;Li et al 2020) of events and their associated argument roles.…”

Section: Related Workmentioning

confidence: 99%

Video Event Extraction via Tracking Visual States of Arguments

Yang¹,

Li²,

Lin³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Existing methods focus on capturing the overall visual scene of each frame, ignoring fine-grained argument-level information. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively. Experiments on various video event extraction tasks demonstrate significant improvements compared to state-of-the-art models. In particular, on verb classification, we achieve 3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation Recognition. 1

show abstract

Visual Semantic Role Labeling for Video Understanding

Cited by 30 publications

References 60 publications

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition

CholecTriplet2021: A benchmark challenge for surgical action triplet recognition

Ambiguous Images With Human Judgments for Robust Visual Event Classification

Video Event Extraction via Tracking Visual States of Arguments

Contact Info

Product

Resources

About