2022
DOI: 10.1016/j.media.2022.102433
|View full text |Cite
|
Sign up to set email alerts
|

Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
63
0
1

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 69 publications
(64 citation statements)
references
References 32 publications
0
63
0
1
Order By: Relevance
“…In the surgical action triplet recognition problem, the main task is to recognise the triplet 𝐼𝑉𝑇, which is the composition of three components In current state-of-the-art (SOTA) deep models [14], [6], there is a communal structure divided into three parts: i) the feature extraction backbone; ii) the individual component encoder; iii) the triplet aggregation decoder that associate the components and output the logits of the 𝐼𝑉𝑇 triplet. More precisely, the the individual component encoder firstly concentrate on the instrument component to output Class Activation Maps (CAMs ∈ R 𝐻 ×𝑊 ×𝐶 𝑑 ) and the logits 𝑌 𝑌 𝑌 𝐼 𝐼 𝐼 of the instrument classes; the CAMs are then associated with the verb and target components separately for their logits (𝑌 𝑌 𝑌 𝑉 𝑉 𝑉 and 𝑌 𝑌 𝑌 𝑇 𝑇 𝑇 ) to address the instrument-centric nature of the triplet.…”
Section: A Surgical Action Triplet Recognitionmentioning
confidence: 99%
See 4 more Smart Citations
“…In the surgical action triplet recognition problem, the main task is to recognise the triplet 𝐼𝑉𝑇, which is the composition of three components In current state-of-the-art (SOTA) deep models [14], [6], there is a communal structure divided into three parts: i) the feature extraction backbone; ii) the individual component encoder; iii) the triplet aggregation decoder that associate the components and output the logits of the 𝐼𝑉𝑇 triplet. More precisely, the the individual component encoder firstly concentrate on the instrument component to output Class Activation Maps (CAMs ∈ R 𝐻 ×𝑊 ×𝐶 𝑑 ) and the logits 𝑌 𝑌 𝑌 𝐼 𝐼 𝐼 of the instrument classes; the CAMs are then associated with the verb and target components separately for their logits (𝑌 𝑌 𝑌 𝑉 𝑉 𝑉 and 𝑌 𝑌 𝑌 𝑇 𝑇 𝑇 ) to address the instrument-centric nature of the triplet.…”
Section: A Surgical Action Triplet Recognitionmentioning
confidence: 99%
“…We do that by analysing the feature based explanations via robustness. To do this, we consider the current three SOTA techniques for our study: Tripnet [14], Attention Tripnet and Rendezvous [6]. Moreover, we extensively investigate the repercussion of deep features using four widely used backbones ResNet-18, ResNet-50 [26], DenseNet-121 [27] and Swin Transformer [28].…”
Section: Hypothesis 21: Deep Features Are Key For Robustnessmentioning
confidence: 99%
See 3 more Smart Citations