2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 2019
DOI: 10.1109/iccvw.2019.00240
|View full text |Cite
|
Sign up to set email alerts
|

Resource Efficient 3D Convolutional Neural Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
53
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 164 publications
(64 citation statements)
references
References 17 publications
0
53
0
Order By: Relevance
“…The following points are noted: Table 12. Comparisons with the-state-of-the art methods on the UCF101 and HMDB51 datasets over 3 splits Algorithm UCF101 HMDB51 C3D [34] 85.2% --Soft Attention + LSTM [13] --41.3% Two-Stream + LSTM [38] 88.6% --TDD+FV [19] 90.3% 63.2% RNN+FV [58] 88.0% 54.3% LTC [33] 91.7% 64.8% ST-ResNet [29] 93.5% 66.4% TSN (BN-Inception) [28] 94.0% 68.5% TSN (Journal version) [78] 94.9% 71.0% AdaScan [59] 89.4% 54.9% ActionVLAD [39] 92.7% 66.9% TLE (BN-Inception) [40] 95.6% 71.1% Attention Cluster (ResNet-152) [47] 94.6% 69.2% TesNet (ImageNet pre-trained) [84] 95.2% 71.5% Two-in-one two stream [85] 92.0% --R(2+1) [86] 85.8% 54.8% 3D-SqueezeNet [87] 74.9% --Algorithm in [88](Pre-trained on Kinetics) 61.2% 33.4% TSM [89] 95.9% 73.5% T-STFT [90] 94.7% 71.5% TEA [91] 96.9%  The results of our methods are much better than the result 13 of the attention-based method, soft attention + LSTM [13].  Our method with BN-Inception as the backbone outperforms the conference version of TSN [28] with BN-Inception as the backbone by 0.8% and 1.1% on the UCF101 and HMDB51 datasets respectively, where both the methods uniformly sample 25 frames from each video for testing.…”
Section: On the Ucf101 And Hmdb51 Datasetsmentioning
confidence: 99%
“…The following points are noted: Table 12. Comparisons with the-state-of-the art methods on the UCF101 and HMDB51 datasets over 3 splits Algorithm UCF101 HMDB51 C3D [34] 85.2% --Soft Attention + LSTM [13] --41.3% Two-Stream + LSTM [38] 88.6% --TDD+FV [19] 90.3% 63.2% RNN+FV [58] 88.0% 54.3% LTC [33] 91.7% 64.8% ST-ResNet [29] 93.5% 66.4% TSN (BN-Inception) [28] 94.0% 68.5% TSN (Journal version) [78] 94.9% 71.0% AdaScan [59] 89.4% 54.9% ActionVLAD [39] 92.7% 66.9% TLE (BN-Inception) [40] 95.6% 71.1% Attention Cluster (ResNet-152) [47] 94.6% 69.2% TesNet (ImageNet pre-trained) [84] 95.2% 71.5% Two-in-one two stream [85] 92.0% --R(2+1) [86] 85.8% 54.8% 3D-SqueezeNet [87] 74.9% --Algorithm in [88](Pre-trained on Kinetics) 61.2% 33.4% TSM [89] 95.9% 73.5% T-STFT [90] 94.7% 71.5% TEA [91] 96.9%  The results of our methods are much better than the result 13 of the attention-based method, soft attention + LSTM [13].  Our method with BN-Inception as the backbone outperforms the conference version of TSN [28] with BN-Inception as the backbone by 0.8% and 1.1% on the UCF101 and HMDB51 datasets respectively, where both the methods uniformly sample 25 frames from each video for testing.…”
Section: On the Ucf101 And Hmdb51 Datasetsmentioning
confidence: 99%
“…Table 1 shows the results of all the 3D-CNN models when processing the new standard sequence images. All the models are from [38]. From the comparisons, we find that although the model parameters of ShuffNet are very limited, ResNet50 achieves the best AUC value in the test set.…”
Section: Comparison With the Classic 3d Cnn Networkmentioning
confidence: 99%
“…In [17], the effect of dataset size on performance is investigated for several 3D-CNN architectures. Inflated versions of popular resourceefficient 2D-CNN architectures are analyzed for video classification tasks in [24]. In this work, we use the variants of 3D-CNNs for the AV-ASD task.…”
Section: Audio-visual Feature Extractionmentioning
confidence: 99%
“…Capturing motion patterns is crucial since movements of facial muscles and mouth are indicative of active speaking. We experimented with several high-performance and resource-efficient 3D-CNN architectures [24]. 3D-ResNeXt-101 performs best and becomes our final choice as video backbone.…”
Section: Audio-visual Encoder Architecturementioning
confidence: 99%
See 1 more Smart Citation