Temporal Convolutional Networks: A Unified Approach to Action Segmentation

Lea, Colin; Vidal, René; Reiter, Austin; Hager, Gregory D.

doi:10.1007/978-3-319-49409-8_7

Cited by 544 publications

(369 citation statements)

References 17 publications

Supporting

Mentioning

364

Contrasting

Unclassified

Order By: Relevance

“…Nevertheless, sensory signals such as speech have long-range temporal dependencies for which recurrent networks may provide a better fit. Although we did not find a significant difference between the prediction accuracy of feedforward and recurrent neural networks in our data ( Supplementary Fig 8), the recent extensions of the feedforward architecture, such as dilated convolution (84) or temporal convolutional networks (85), can implement receptive fields that extend over long durations. Our proposed LLRF method would seamlessly generalize to these architectures, which can serve as an alternative to recurrent neural networks when modeling the long-term dependencies of the stimulus is crucial.…”

Section: Discussioncontrasting

confidence: 66%

Estimating and interpreting nonlinear receptive fields of sensory responses with deep neural network models

Keshishian

Akbari

Khalighinejad

et al. 2019

Preprint

View full text Add to dashboard Cite

Sensory processing by neural circuits includes numerous nonlinear transformations that are critical to perception. Our understanding of these nonlinear mechanisms, however, is hindered by the lack of a comprehensive and interpretable computational framework that can model and explain nonlinear signal transformations. Here, we propose a data-driven framework based on deep neural network regression models that can directly learn any nonlinear stimulus-response mapping. A key component of this approach is an analysis method that reformulates the exact function of the trained neural network as a collection of stimulus-dependent linear functions.This locally linear receptive field interpretation of the network function enables straightforward comparison with conventional receptive field models and uncovers nonlinear encoding properties. We demonstrate the efficacy of this framework by predicting the neural responses recorded invasively from the auditory cortex of neurosurgical patients as they listened to speech.Our method significantly improves the prediction accuracy of auditory cortical responses particularly in nonprimary areas. Moreover, interpreting the functions learned by neural networks uncovered three distinct types of nonlinear transformations of speech that varied considerably in primary and nonprimary auditory regions. By combining two desired properties of a computational sensory-response model; the ability to capture arbitrary stimulus-response mappings and maintaining model interpretability, this data-driven method can lead to better neurophysiological models of the sensory processing.

show abstract

Section: Discussioncontrasting

confidence: 66%

Estimating and interpreting nonlinear receptive fields of sensory responses with deep neural network models

Keshishian

Akbari

Khalighinejad

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Song et al propose including layers for spatial and temporal attention (STA-LSTM) [28] which greatly improves the recognition performance. For majority of the experiments in this paper, we will use the Temporal Convolution Network (TCN) with residual connections [18] as they are effective, simple to build and faster to train compared to LSTM-based networks. Additionally, Kim and Reiter have shown excellent results on using TCNs for 3D action recognition [17].…”

Section: Alignment Of Time-series Datamentioning

confidence: 99%

Temporal Transformer Networks: Joint Learning of Invariant and Discriminative Time Warping

Lohit

Wang

Turaga

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Many time-series classification problems involve developing metrics that are invariant to temporal misalignment. In human activity analysis, temporal misalignment arises due to various reasons including differing initial phase, sensor sampling rates, and elastic time-warps due to subjectspecific biomechanics. Past work in this area has only looked at reducing intra-class variability by elastic temporal alignment. In this paper, we propose a hybrid modelbased and data-driven approach to learn warping functions that not just reduce intra-class variability, but also increase inter-class separation. We call this a temporal transformer network (TTN). TTN is an interpretable differentiable module, which can be easily integrated at the front end of a classification network. The module is capable of reducing intra-class variance by generating input-dependent warping functions which lead to rate-robust representations. At the same time, it increases inter-class variance by learning warping functions that are more discriminative. We show improvements over strong baselines in 3D action recognition on challenging datasets using the proposed framework. The improvements are especially pronounced when training sets are smaller.

show abstract

“…More recently, Convolutional Neural Networks (CNNs) became a popular tool for visual feature extraction. For example, Lea et al train a CNN (S-CNN ) for frame-wise gesture recognition [9] and use the latent video frame encodings as feature representations, which are further processed by a TCN for gesture recognition [10]. A TCN combines 1D convolutional filters with pooling and channel-wise normalization layers to hierarchically capture temporal relationships at low-, intermediate-, and high-level time scales.…”

mentioning

confidence: 99%

“…Features extracted from individual video frames cannot represent the dynamics in surgical video, i.e., changes between adjacent frames. To alleviate this problem, Lea et al [10] propose adding a number of difference images to the input fed to the S-CNN. For timestep t, difference images are calculated within a window of 2 seconds around frame v t .…”

mentioning

confidence: 99%

See 1 more Smart Citation

Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video

Funke

Bodenstedt

Oehme

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Automatically recognizing surgical gestures is a crucial step towards a thorough understanding of surgical skill. Possible areas of application include automatic skill assessment, intra-operative monitoring of critical surgical steps, and semi-automation of surgical tasks. Solutions that rely only on the laparoscopic video and do not require additional sensor hardware are especially attractive as they can be implemented at low cost in many scenarios. However, surgical gesture recognition based only on video is a challenging problem that requires effective means to extract both visual and temporal information from the video. Previous approaches mainly rely on frame-wise feature extractors, either handcrafted or learned, which fail to capture the dynamics in surgical video. To address this issue, we propose to use a 3D Convolutional Neural Network (CNN) to learn spatiotemporal features from consecutive video frames. We evaluate our approach on recordings of robot-assisted suturing on a bench-top model, which are taken from the publicly available JIGSAWS dataset. Our approach achieves high frame-wise surgical gesture recognition accuracies of more than 84%, outperforming comparable models that either extract only spatial features or model spatial and low-level temporal information separately. For the first time, these results demonstrate the benefit of spatiotemporal CNNs for video-based surgical gesture recognition.

show abstract

Temporal Convolutional Networks: A Unified Approach to Action Segmentation

Cited by 544 publications

References 17 publications

Estimating and interpreting nonlinear receptive fields of sensory responses with deep neural network models

Estimating and interpreting nonlinear receptive fields of sensory responses with deep neural network models

Temporal Transformer Networks: Joint Learning of Invariant and Discriminative Time Warping

Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video

Contact Info

Product

Resources

About