2016
DOI: 10.1007/978-3-319-49409-8_7
|View full text |Cite
|
Sign up to set email alerts
|

Temporal Convolutional Networks: A Unified Approach to Action Segmentation

Abstract: The dominant paradigm for video-based action segmentation is composed of two steps: first, for each frame, compute low-level features using Dense Trajectories or a Convolutional Neural Network that encode spatiotemporal information locally, and second, input these features into a classifier that captures high-level temporal relationships, such as a Recurrent Neural Network (RNN). While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents captu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

2
364
1
2

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 544 publications
(369 citation statements)
references
References 17 publications
2
364
1
2
Order By: Relevance
“…Nevertheless, sensory signals such as speech have long-range temporal dependencies for which recurrent networks may provide a better fit. Although we did not find a significant difference between the prediction accuracy of feedforward and recurrent neural networks in our data ( Supplementary Fig 8), the recent extensions of the feedforward architecture, such as dilated convolution (84) or temporal convolutional networks (85), can implement receptive fields that extend over long durations. Our proposed LLRF method would seamlessly generalize to these architectures, which can serve as an alternative to recurrent neural networks when modeling the long-term dependencies of the stimulus is crucial.…”
Section: Discussioncontrasting
confidence: 66%
“…Nevertheless, sensory signals such as speech have long-range temporal dependencies for which recurrent networks may provide a better fit. Although we did not find a significant difference between the prediction accuracy of feedforward and recurrent neural networks in our data ( Supplementary Fig 8), the recent extensions of the feedforward architecture, such as dilated convolution (84) or temporal convolutional networks (85), can implement receptive fields that extend over long durations. Our proposed LLRF method would seamlessly generalize to these architectures, which can serve as an alternative to recurrent neural networks when modeling the long-term dependencies of the stimulus is crucial.…”
Section: Discussioncontrasting
confidence: 66%
“…Song et al propose including layers for spatial and temporal attention (STA-LSTM) [28] which greatly improves the recognition performance. For majority of the experiments in this paper, we will use the Temporal Convolution Network (TCN) with residual connections [18] as they are effective, simple to build and faster to train compared to LSTM-based networks. Additionally, Kim and Reiter have shown excellent results on using TCNs for 3D action recognition [17].…”
Section: Alignment Of Time-series Datamentioning
confidence: 99%
“…More recently, Convolutional Neural Networks (CNNs) became a popular tool for visual feature extraction. For example, Lea et al train a CNN (S-CNN ) for frame-wise gesture recognition [9] and use the latent video frame encodings as feature representations, which are further processed by a TCN for gesture recognition [10]. A TCN combines 1D convolutional filters with pooling and channel-wise normalization layers to hierarchically capture temporal relationships at low-, intermediate-, and high-level time scales.…”
mentioning
confidence: 99%
“…Features extracted from individual video frames cannot represent the dynamics in surgical video, i.e., changes between adjacent frames. To alleviate this problem, Lea et al [10] propose adding a number of difference images to the input fed to the S-CNN. For timestep t, difference images are calculated within a window of 2 seconds around frame v t .…”
mentioning
confidence: 99%
See 1 more Smart Citation