2021
DOI: 10.1016/j.neunet.2021.05.034
|View full text |Cite
|
Sign up to set email alerts
|

QTTNet: Quantized tensor train neural networks for 3D object and video recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 21 publications
(5 citation statements)
references
References 16 publications
0
5
0
Order By: Relevance
“…2D Spatial Convolution Followed By 1D Temporal Convolution (R(2 + 1)D) [14] utilizes 2D and 1D convolution to achieve the effect of 3D convolution, which can realize better results with the same number of parameters. Deformable 3D Convolution (D3D) [15] implements a learnable bias to the sampling position of the 3D convolution to achieve 3D deformable convolution, and Quantized Tensor Train Neural Networks (QTTNet) [16] employs the Tensor Train (TT) format [17] to improve 3D convolution, which greatly reduces the time costs and number of parameters required for 3D convolution networks. In addition, Cao et al [18] employ 3D depth-wise separable convolution to recognize video gestures, and Zhu et al [19] employ a pyramid 3D convolution network architecture for gesture recognition.…”
Section: Related Work a 3d Convolutionmentioning
confidence: 99%
“…2D Spatial Convolution Followed By 1D Temporal Convolution (R(2 + 1)D) [14] utilizes 2D and 1D convolution to achieve the effect of 3D convolution, which can realize better results with the same number of parameters. Deformable 3D Convolution (D3D) [15] implements a learnable bias to the sampling position of the 3D convolution to achieve 3D deformable convolution, and Quantized Tensor Train Neural Networks (QTTNet) [16] employs the Tensor Train (TT) format [17] to improve 3D convolution, which greatly reduces the time costs and number of parameters required for 3D convolution networks. In addition, Cao et al [18] employ 3D depth-wise separable convolution to recognize video gestures, and Zhu et al [19] employ a pyramid 3D convolution network architecture for gesture recognition.…”
Section: Related Work a 3d Convolutionmentioning
confidence: 99%
“…The pioneering work by Han et al ( 2015 ) proposed a three-step pipeline to compress a pre-trained model by pruning the uninformative connections, quantizing the remaining weights, and encoding the discretized parameters. These ideas are complementary to low-rank factorization—Goyal et al ( 2019 ) demonstrated a joint use of pruning and low-rank factorization, and Lee et al ( 2021 ) a combination of quantization and low-rank factorization.…”
Section: Related Workmentioning
confidence: 99%
“…A dynamic policy learning network is trained simultaneously with the stand video recognition network to produce the target precision for each frame, with the computational cost loss to both competitive performance and resource efficiency. Furthermore, Lee et al [85] combine tensor train and model quantization to firstly reduce the number of trainable parameters and then conduct low bit quantization to all parameters to fully lower the memory and time cost produced by 3D CNNs. With lower bit of parameters, the storage requirements, inference time and energy consumption of video processing models can be significantly reduced, making video processing more computational efficient.…”
Section: Model-based Video Processing Accelerationmentioning
confidence: 99%