Exploring multimodal video representation for action recognition

Wang, Cheng; Yang, Haojin; Meinel, Christoph

doi:10.1109/ijcnn.2016.7727435

Cited by 12 publications

(14 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors of this study cater to the key limitation of Deconvolution layer that suffers from a checkerboard artifact problem the neural network is used as a source of a semi-supervised method for annotation. The Zheng et al [30], proposed MMDF-LDA: an improved multimodal latent Dirichlet allocation model for social image annotation. The authors focus on developing a data fusion model for social image annotation.…”

Section: Semi-supervised Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A methodology for image annotation of human actions in videos

Waheed

Hussain

Khan

et al. 2020

Multimed Tools Appl

View full text Add to dashboard Cite

In the context of video-based image classification, image annotation plays a vital role in improving the image classification decision based on it's semantics. Though, several methods have been introduced to adopt the image annotation such as manual and semisupervised. However, formal specification, high cost, high probability of errors and computation time remain major issues to perform image annotation. In order to overcome these issues, we propose a new image annotation technique which consists of three tiers namely frames extraction, interest point's generation, and clustering. The aim of the proposed technique is to automate the label generation of video frames. Moreover, an evaluation model to assess the effectiveness of the proposed technique is used. The promising results of the proposed technique indicate the effectiveness (77% in terms of Adjusted Random Index) of the proposed technique in the context label generation for video frames. In the end, a comparative study analysis is made between the existing techniques and proposed methodology.

show abstract

Section: Semi-supervised Methodsmentioning

confidence: 99%

“…Wang et al [30], works on retrieving the perceptual information present inside videos. The information was based on human action recognition as spatial temporal constrains have made significant contribution on it.…”

Section: Semi-supervised Methodsmentioning

confidence: 99%

A methodology for image annotation of human actions in videos

Waheed

Hussain

Khan

et al. 2020

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…Audio recording that accompanies the visual stream provides complementary information to appearance and motion information; for example, specific actions may be characterized by their unique sounds. Combining these two modalities within a deep learning pipeline at either the data level or the feature level has been thoroughly studied in many works [137,105,50,69], benefiting from the fact that video cameras typically provide both visual and audio streams simultaneously. Because video and audio data are largely heterogeneous, most of the fusion methods are represented by either features fusion or scores fusion.…”

Section: Video and Audiomentioning

confidence: 99%

“…Because video and audio data are largely heterogeneous, most of the fusion methods are represented by either features fusion or scores fusion. A three-pathways network was proposed to combine RGB frames, optical flow, and audio signal and demonstrated that simple feature fusion performed better than late scores fusion [137]. A two-stream CNN was trained in a self-supervised manner to capture the temporal alignment between the audio and video frames [105].…”

Section: Video and Audiomentioning

confidence: 99%

“…A Temporal Binding Network (TBN) [69] fuses multimodal input sequences within a binding window. As in [137] the authors combined audio, RGB frames, and optical flow from egocentric videos in a late feature fusion fashion and applied temporal aggregation to the extracted features for the final classification. The TBNs achieved state-of-the-art results on EPIC-Kitchens egocentric videos, demonstrating the importance of audio in egocentric vision.…”

Section: Video and Audiomentioning

confidence: 99%

See 1 more Smart Citation