QTTNet: Quantized tensor train neural networks for 3D object and video recognition

Lee, Donghyun; Wang, Dingheng; Yang, Yukuan; Deng, Lei; Zhao, Guiping; Li, Guoqi

doi:10.1016/j.neunet.2021.05.034

Cited by 21 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2D Spatial Convolution Followed By 1D Temporal Convolution (R(2 + 1)D) [14] utilizes 2D and 1D convolution to achieve the effect of 3D convolution, which can realize better results with the same number of parameters. Deformable 3D Convolution (D3D) [15] implements a learnable bias to the sampling position of the 3D convolution to achieve 3D deformable convolution, and Quantized Tensor Train Neural Networks (QTTNet) [16] employs the Tensor Train (TT) format [17] to improve 3D convolution, which greatly reduces the time costs and number of parameters required for 3D convolution networks. In addition, Cao et al [18] employ 3D depth-wise separable convolution to recognize video gestures, and Zhu et al [19] employ a pyramid 3D convolution network architecture for gesture recognition.…”

Section: Related Work a 3d Convolutionmentioning

confidence: 99%

Deep Learning-Based Standard Sign Language Discrimination

Zhang,

Yang,

Zhao

2023

IEEE Access

View full text Add to dashboard Cite

General sign language recognition models are only designed for recognizing categories, i.e., such models do not discriminate standard and nonstandard sign language actions made by learners. It is inadequate to use in a sign language education software. To address this issue, this paper proposed a sign language category and standardization correctness discrimination model for sign language education. The proposed model is implemented with a hand detection and standard sign language discrimination method. For hand detection, the proposed method utilizes flow-guided features and acquires relevant proposals using stable and flow key frame detections. This model can resolve the inconsistency between the forward optical flow and the box center point offset. In addition, the proposed method employs an encoder-decoder model structure for sign language correctness discrimination. The encoder model combines 3D convolution and 2D deformable convolution results with residual structures, and it implements a sequence attention mechanism. A Sign Language Correctness Discrimination dataset (SLCD dataset) was also constructed in this study. In this dataset, each sign language video has two recognition labels, i.e., sign language category and standardization category. The semi-supervised learning method was employed to generate pseudo hand position labels. The hand detection model was getting sufficiently high hand detection result. The sign language correctness discrimination model was tested with hand patches or full images. SLCD dataset is

show abstract

Section: Related Work a 3d Convolutionmentioning

confidence: 99%

Deep Learning-Based Standard Sign Language Discrimination

Zhang,

Yang,

Zhao

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The pioneering work by Han et al ( 2015 ) proposed a three-step pipeline to compress a pre-trained model by pruning the uninformative connections, quantizing the remaining weights, and encoding the discretized parameters. These ideas are complementary to low-rank factorization—Goyal et al ( 2019 ) demonstrated a joint use of pruning and low-rank factorization, and Lee et al ( 2021 ) a combination of quantization and low-rank factorization.…”

Section: Related Workmentioning

confidence: 99%

Compact Neural Architecture Designs by Tensor Representations

Liu

et al. 2022

Front. Artif. Intell.

View full text Add to dashboard Cite

We propose a framework of tensorial neural networks (TNNs) extending existing linear layers on low-order tensors to multilinear operations on higher-order tensors. TNNs have three advantages over existing networks: First, TNNs naturally apply to higher-order data without flattening, which preserves their multi-dimensional structures. Second, compressing a pre-trained network into a TNN results in a model with similar expressive power but fewer parameters. Finally, TNNs interpret advanced compact designs of network architectures, such as bottleneck modules and interleaved group convolutions. To learn TNNs, we derive their backpropagation rules using a novel suite of generalized tensor algebra. With backpropagation, we can either learn TNNs from scratch or pre-trained models using knowledge distillation. Experiments on VGG, ResNet, and Wide-ResNet demonstrate that TNNs outperform the state-of-the-art low-rank methods on a wide range of backbone networks and datasets.

show abstract

“…A dynamic policy learning network is trained simultaneously with the stand video recognition network to produce the target precision for each frame, with the computational cost loss to both competitive performance and resource efficiency. Furthermore, Lee et al [85] combine tensor train and model quantization to firstly reduce the number of trainable parameters and then conduct low bit quantization to all parameters to fully lower the memory and time cost produced by 3D CNNs. With lower bit of parameters, the storage requirements, inference time and energy consumption of video processing models can be significantly reduced, making video processing more computational efficient.…”

Section: Model-based Video Processing Accelerationmentioning

confidence: 99%

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Wang¹,

Guo²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual-Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.CCS Concepts: • Human-centered computing → HCI theory, concepts and models; • Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

QTTNet: Quantized tensor train neural networks for 3D object and video recognition

Cited by 21 publications

References 16 publications

Deep Learning-Based Standard Sign Language Discrimination

Deep Learning-Based Standard Sign Language Discrimination

Compact Neural Architecture Designs by Tensor Representations

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Contact Info

Product

Resources

About