2019 IEEE International Conference on Image Processing (ICIP) 2019
DOI: 10.1109/icip.2019.8803088
|View full text |Cite
|
Sign up to set email alerts
|

Atrous Temporal Convolutional Network for Video Action Segmentation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 13 publications
0
6
0
Order By: Relevance
“…Built on a similar encoder-decoder backbone, more recent solutions rely instead on two-stream processing: one stream to capture multiscale contextual information and action dependencies in the input sequence, the other to extract lowlevel information for precise action boundary identification [47,48]. Basic temporal convolutions were replaced with a set of Deformable Temporal Residual Modules (TDRM) to capture the temporal variability of human actions and merge the two streams at multiple processing levels [47].…”
Section: A Convolutional Neural Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…Built on a similar encoder-decoder backbone, more recent solutions rely instead on two-stream processing: one stream to capture multiscale contextual information and action dependencies in the input sequence, the other to extract lowlevel information for precise action boundary identification [47,48]. Basic temporal convolutions were replaced with a set of Deformable Temporal Residual Modules (TDRM) to capture the temporal variability of human actions and merge the two streams at multiple processing levels [47].…”
Section: A Convolutional Neural Networkmentioning
confidence: 99%
“…Basic temporal convolutions were replaced with a set of Deformable Temporal Residual Modules (TDRM) to capture the temporal variability of human actions and merge the two streams at multiple processing levels [47]. Alternatively, atrous (aka dilated) temporal convolutions and pyramid pooling were used to explicitly generate multi-scale encoding of the input data, that were merged with the lowlevel local features just before the decoding phase [48]. Both architectures were able to improve consistently both framewise and segmental evaluation scores (see Section VII for a description and explanation of the most common evaluation metrics) upon the competing methods.…”
Section: A Convolutional Neural Networkmentioning
confidence: 99%
“…For example, the variations in facial expressions can only be exploited by modeling a sequence of consecutive facial images. Recent studies have demonstrated that temporal convolutional networks (TCN) are of great competitiveness in plenty of sequence modeling tasks including action segmentation [18,29,56], natural language processing [3,13] and so on. Compared with recurrent neural network (RNN) architectures, TCN has lower memory requirement and less training difficulty [3].…”
Section: Appearance-based Predictionmentioning
confidence: 99%
“…Surgical video rather than kinematics also embeds gesture information which can be extracted with spatio-temporal CNNs [27], 3D CNNs [28], multi-scale temporal convolutions [29,30] or hybrid encoder-decoder networks with temporal-convolutional filters for local motion modelling and bidirectional LSTM for long-range dependency memorization [31].…”
Section: A Related Workmentioning
confidence: 99%