2016 International Joint Conference on Neural Networks (IJCNN) 2016
DOI: 10.1109/ijcnn.2016.7727435
|View full text |Cite
|
Sign up to set email alerts
|

Exploring multimodal video representation for action recognition

Abstract: A video contains rich perceptual information, such as visual appearance, motion and audio, which can be used for understanding the activities in videos. Recent works have shown the combination of appearance (spatial) and motion (temporal) clues can significantly improve human action recognition performance in videos. To further explore the multimodal representation of video in action recognition, We propose a framework to learn a multimodal representations from video appearance, motion as well as audio data. C… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
13
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(14 citation statements)
references
References 24 publications
0
13
1
Order By: Relevance
“…The authors of this study cater to the key limitation of Deconvolution layer that suffers from a checkerboard artifact problem the neural network is used as a source of a semi-supervised method for annotation. The Zheng et al [30], proposed MMDF-LDA: an improved multimodal latent Dirichlet allocation model for social image annotation. The authors focus on developing a data fusion model for social image annotation.…”
Section: Semi-supervised Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The authors of this study cater to the key limitation of Deconvolution layer that suffers from a checkerboard artifact problem the neural network is used as a source of a semi-supervised method for annotation. The Zheng et al [30], proposed MMDF-LDA: an improved multimodal latent Dirichlet allocation model for social image annotation. The authors focus on developing a data fusion model for social image annotation.…”
Section: Semi-supervised Methodsmentioning
confidence: 99%
“…Wang et al [30], works on retrieving the perceptual information present inside videos. The information was based on human action recognition as spatial temporal constrains have made significant contribution on it.…”
Section: Semi-supervised Methodsmentioning
confidence: 99%
“…Audio recording that accompanies the visual stream provides complementary information to appearance and motion information; for example, specific actions may be characterized by their unique sounds. Combining these two modalities within a deep learning pipeline at either the data level or the feature level has been thoroughly studied in many works [137,105,50,69], benefiting from the fact that video cameras typically provide both visual and audio streams simultaneously. Because video and audio data are largely heterogeneous, most of the fusion methods are represented by either features fusion or scores fusion.…”
Section: Video and Audiomentioning
confidence: 99%
“…Because video and audio data are largely heterogeneous, most of the fusion methods are represented by either features fusion or scores fusion. A three-pathways network was proposed to combine RGB frames, optical flow, and audio signal and demonstrated that simple feature fusion performed better than late scores fusion [137]. A two-stream CNN was trained in a self-supervised manner to capture the temporal alignment between the audio and video frames [105].…”
Section: Video and Audiomentioning
confidence: 99%
See 1 more Smart Citation