OperA: Attention-Regularized Transformers for Surgical Phase Recognition

Czempiel, Tobias; Paschali, Magdalini; Ostler, Daniel; Kim, Seong-Tae; Busam, Benjamin; Navab, Nassir

doi:10.1007/978-3-030-87202-1_58

Cited by 60 publications

(37 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Convolutional architectures such as ResNet-50 have been extensively used for phase segmentation of endoscopic videos. They serve as feature extraction backbones for many state-of-the-art recognition architectures [15,7,33]. Our baseline consists of a ResNet-50 pre-trained on ImageNet without temporal modeling.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Know your sensORs -- A Modality Study For Surgical Action Classification

Bastian¹,

Czempiel²,

Heiliger³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The surgical operating room (OR) presents many opportunities for automation and optimization. Videos from various sources in the OR are becoming increasingly available. The medical community seeks to leverage this wealth of data to develop automated methods to advance interventional care, lower costs, and improve overall patient outcomes. Existing datasets from OR room cameras are thus far limited in size or modalities acquired, leaving it unclear which sensor modalities are best suited for tasks such as recognizing surgical action from videos. This study demonstrates that surgical action recognition performance can vary depending on the image modalities used. We perform a methodical analysis on several commonly available sensor modalities, presenting two fusion approaches that improve classification performance. The analyses are carried out on a set of multi-view RGB-D video recordings of 18 laparoscopic procedures.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Current stateof-the-art methods typically combine convolutional backbones with LSTM or attention-based temporal accumulators. These methods are particularly well suited for longer videos, as frequently seen in the surgical domain, where acquisitions span many hours [9,6,7].…”

Section: Related Workmentioning

confidence: 99%

Know your sensORs -- A Modality Study For Surgical Action Classification

Bastian¹,

Czempiel²,

Heiliger³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Czempiel et al [14] proposed to replace the frequently used LSTMs with a multi-stage temporal convolution network (TCN) [15] analyzing the long temporal relationships more efficiently. Additionally, attention-based transformer architectures [16] have been proposed [17] [18] to refine the temporal context even further and increase model interpretability. Fair Evaluation One of the biggest challenges in this domain is the limited benchmarking between existing methods.…”

Section: Related Workmentioning

confidence: 99%

Surgical Workflow Recognition: from Analysis of Challenges to Architectural Study

Czempiel¹,

Sharghi²,

Paschali³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Algorithmic surgical workflow recognition is an ongoing research field and can be divided into laparoscopic (Internal) and operating room (External) analysis. So far many different works for the internal analysis have been proposed with the combination of a frame-level and an additional temporal model to address the temporal ambiguities between different workflow phases. For the External recognition task, Clip-level methods are in the focus of researchers targeting the local ambiguities present in the OR scene. In this work we evaluate combinations of different model architectures for the task of surgical workflow recognition to provide a fair comparison of the methods for both Internal and External analysis. We show that methods designed for the Internal analysis can be transferred to the external task with comparable performance gains for different architectures.

show abstract

“…Here, a CNN is first trained on randomly sampled image batches, followed by a temporal model trained on the extracted visual features. Methods in this style have been proposed for phase recognition [9,10,16,59,62,63], duration prediction [2], tracking [39] or anticipation [61]. Most notably, TeCNO [9], a MS-TCN [13] trained on ResNet features, is the popular approach for 2-stage learning and Trans-SVNet [16], a 3-stage method which trains a Transformer model on TeCNO features, is the current state of the art in surgical phase recognition.…”

Section: Surgical Workflow Analysismentioning

confidence: 99%

“…BN-related issues can be avoided by using multi-stage training procedures where backbones are trained on randomly sampled image batches. While the majority of research in surgical workflow analysis [2,10,9,16,39,59,61,62,63] has opted for this strategy, it has several disadvantages. Firstly, it increases the number of hyperparameters since learning rate, number of epochs etc.…”

Section: Disadvantages Of Multi-stage Learningmentioning

confidence: 99%

On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis

Rivoir¹,

Funke²,

Speidel³

2022

Preprint

View full text Add to dashboard Cite

Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequential modeling, and has led to the use of alternatives in these fields. In video learning, however, these problems are less studied, despite the ubiquitous use of BN in CNNs for visual feature extraction. We argue that BN's properties create major obstacles for training CNNs and temporal models end to end in video tasks. Yet, end-to-end learning seems preferable in specialized domains such as surgical workflow analysis, which lack well-pretrained feature extractors. While previous work in surgical workflow analysis has avoided BN-related issues through complex, multi-stage learning procedures, we show that even simple, endto-end CNN-LSTMs can outperform the state of the art when CNNs without BN are used. Moreover, we analyze in detail when BN-related issues occur, including a "cheating" phenomenon in surgical anticipation tasks. We hope that a deeper understanding of BN's limitations and a reconsideration of end-to-end approaches can be beneficial for future research in surgical workflow analysis and general video learning.

show abstract

OperA: Attention-Regularized Transformers for Surgical Phase Recognition

Cited by 60 publications

References 16 publications

Know your sensORs -- A Modality Study For Surgical Action Classification

Know your sensORs -- A Modality Study For Surgical Action Classification

Surgical Workflow Recognition: from Analysis of Challenges to Architectural Study

On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis

Contact Info

Product

Resources

About