TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Wang, Shunli; Yang, Dingkang; Zhai, Peng; Chen, Chixiao; Zhang, Lihua

doi:10.1145/3474085.3475438

Cited by 37 publications

(18 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Under 'w/o DD', out method achieves 0.9451(Sp. Corr) and 0.3222(R-ℓ 2 ), outperforms the CoRe and recent TSA-Net [27]. It's worth noting that TSA-Net utilizes an external VOT tracker [35] to extract human locations and then enhance backbone features, which is orthogonal to the main issue of temporal parsing addressed in our work.…”

Section: Comparison To State-of-the-artmentioning

confidence: 96%

“…On MTL-AQA dataset, we evaluated our experiments with two different settings, following prior work [32]. Specifically, the MTL-AQA dataset contains Pose+DCT [20] 0.2682 -C3D-SVR [19] 0.7716 -C3D-LSTM [19] 0.8489 -MSCADC-STL [18] 0.8472 -C3D-AVG-STL [18] 0.8960 -MSCADC-MTL [18] 0.8612 -C3D-AVG-MTL [18] 0.9044 -USDL [22] 0.9066 0.654 CoRe [32] 0.9341 0.365 TSA-Net [27] 0.9422 -Ours 0.9451 0.3222 Method (w/ DD) Sp. Corr R-ℓ2(× 100) USDL [22] 0.9231 0.468 MUSDL [22] 0.9273 0.451 CoRe [32] 0.9512 0.260 Ours 0.9607 0.2378 the label of difficult degree, and each video's quality score is calculated by the multiplication of the raw score with its difficulty.…”

Section: Comparison To State-of-the-artmentioning

confidence: 99%

See 1 more Smart Citation

Action Quality Assessment with Temporal Parsing Transformer

Yang¹,

Zhou²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.

show abstract

Section: Comparison To State-of-the-artmentioning

confidence: 96%

Section: Comparison To State-of-the-artmentioning

confidence: 99%

Action Quality Assessment with Temporal Parsing Transformer

Yang¹,

Zhou²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Fall Recognition in Figure Skating (FR-FS) [11]. Existing AQA datasets of figure skating [5], [10] only contain long videos, which will lead to the inundation of detailed information, and finally affect the evaluation performance of the AQA model.…”

Section: Unlv Dive and Unlv Vaultmentioning

confidence: 99%

“…Existing AQA datasets of figure skating [5], [10] only contain long videos, which will lead to the inundation of detailed information, and finally affect the evaluation performance of the AQA model. To solve this problem, Wang et al [11] proposed FR-FS to recognize figure skating falls and planned to build a delicate granularity AQA system gradually. Videos in FR-FS contains the movements of the athlete's take-off, rotation, and landing.…”

Section: Unlv Dive and Unlv Vaultmentioning

confidence: 99%

“…Action units are encoded from dense trajectories with a LSTM network. Wang et al [11] proposed a feature aggregation mechanism named tube self-attention module to efficiently generate rich spatio-temporal contextual information by adopting sparse feature interactions. Some studies focus on the design of network loss to obtain better action quality assessment performance.…”

Section: A Aqa Models In Sportsmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Video-based Action Quality Assessment

Wang,

Yang,

Zhai

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Human action recognition and analysis have great demand and important application significance in video surveillance, video retrieval, and human-computer interaction. The task of human action quality evaluation requires the intelligent system to automatically and objectively evaluate the action completed by the human. The action quality assessment model can reduce the human and material resources spent in action evaluation and reduce subjectivity. In this paper, we provide a comprehensive survey of existing papers on video-based action quality assessment. Different from human action recognition, the application scenario of action quality assessment is relatively narrow. Most of the existing work focuses on sports and medical care. We first introduce the definition and challenges of human action quality assessment. Then we present the existing datasets and evaluation metrics. In addition, we summarized the methods of sports and medical care according to the model categories and publishing institutions according to the characteristics of the two fields. At the end, combined with recent work, the promising development direction in action quality assessment is discussed.

show abstract

Contrastive distortion‐level learning‐based no‐reference image‐quality assessment

Wei

Zhou

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

A contrastive distortion-level learning-based no-reference image-quality assessment (NR-IQA) framework is proposed in this study to further effectively model various distortion types with the same or different distortion levels.The proposed method aims to improve the prediction accuracy of NR-IQA. The proposed method consists of three parts: multiscale distortion-level representation learning, single-image NR-IQA, and a representation affinity module, which can reduce NR-IQA computational complexity while maintaining a low-distortion representation of high-distortion inputs. The proposed NR-IQA method aims to extract distributional features of samples in real distorted images and predict ambiguity based on distortion-level learning. Experimental results show that by comparing on many NR-IQA data sets the proposed method can outperform state-of-the-art methods.

show abstract

TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Cited by 37 publications

References 41 publications

Action Quality Assessment with Temporal Parsing Transformer

Action Quality Assessment with Temporal Parsing Transformer

A Survey of Video-based Action Quality Assessment

Contrastive distortion‐level learning‐based no‐reference image‐quality assessment

Contact Info

Product

Resources

About