Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning

Yudistira, Novanto; Kurita, Takio

doi:10.1186/s13640-017-0235-9

Cited by 26 publications

(16 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This is consistent with results in other NLP tasks. As for image encoders, VGGNet achieves higher scores than ResNet, which is often observed in multimodal tasks (Wang et al, 2017;Ouyang et al, 2017;Yudistira and Kurita, 2017). BERT × VGGNet using all the input modalities achieves the highest R 10 @1 score of 53.6%.…”

Section: Quantitative Resultsmentioning

confidence: 98%

A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Kamezawa¹,

Nishida

Sato

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visuallygrounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents' verbal and nonverbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that firstperson vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available 1 .

show abstract

Section: Quantitative Resultsmentioning

confidence: 98%

A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

Kamezawa¹,

Nishida

Sato

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Deep learning based human action recognition solutions is proliferating with an added advantage one over another. Multistream deep architectures [8] [9] have surpassed the performances single stream deep state-of-the-arts [1] [16] due to the fact that such architectures are enriched with fusion of different types of action cues-temporal, motion, and spatial. The motion between frames is majorly defined as optical flow [16].…”

Section: Related Workmentioning

confidence: 99%

“…The mainstream literature listed above [5] [8] [9] [17] targeted action recognition from a common viewpoint. Such frameworks fail to produce a good performance for different viewpoint test samples.…”

Section: Related Workmentioning

confidence: 99%

View-Invariant Deep Architecture for Human Action Recognition Using Two-Stream Motion and Shape Temporal Dynamics

Dhiman

Vishwakarma

2020

IEEE Trans. on Image Process.

125

View full text Add to dashboard Cite

Human action Recognition for unknown views is a challenging task. We propose a view-invariant deep human action recognition framework, which is a novel integration of two important action cues: motion and shape temporal dynamics (STD). The motion stream encapsulates the motion content of action as RGB Dynamic Images (RGB-DIs) which are processed by the fine-tuned InceptionV3 model. The STD stream learns longterm view-invariant shape dynamics of action using human pose model (HPM) based view-invariant features mined from structural similarity index matrix (SSIM) based key depth human pose frames. To predict the score of the test sample, three types of late fusion (maximum, average and product) techniques are applied on individual stream scores. To validate the performance of the proposed novel framework the experiments are performed using both cross subject and cross-view validation schemes on three publically available benchmarks-NUCLA multi-view dataset, UWA3D-II Activity dataset and NTU RGB-D Activity dataset. Our algorithm outperforms with existing state-of-the-arts significantly that is reported in terms of accuracy, receiver operating characteristic (ROC) curve and area under the curve (AUC).

show abstract

“…Human behavior recognition is one of the growing research topics in computer vision and pattern recognition. Human behavior recognition is usually applied in machine learning to monitoring human activities and getting insight from them [1]. The behavioral examination can help solve many problems in indoor as well as outdoor surveillance systems.…”

Section: Introductionmentioning

confidence: 99%

Facial Expression Recognition using Residual Convnet with Image Augmentations

Rahadika

Yudistira

Sari

2021

Jurnal Ilmu Komputer dan Informasi

Self Cite

View full text Add to dashboard Cite

During the COVID-19 pandemic, many offline activities are turned into online activities via video meetings to prevent the spread of the COVID 19 virus. In the online video meeting, some micro-interactions are missing when compared to direct social interactions. The use of machines to assist facial expression recognition in online video meetings is expected to increase understanding of the interactions among users. Many studies have shown that CNN-based neural networks are quite effective and accurate in image classification. In this study, some open facial expression datasets were used to train CNN-based neural networks with a total number of training data of 342,497 images. This study gets the best results using ResNet-50 architecture with Mish activation function and Accuracy Booster Plus block. This architecture is trained using the Ranger and Gradient Centralization optimization method for 60000 steps with a batch size of 256. The best results from the training result in accuracy of AffectNet validation data of 0.5972, FERPlus validation data of 0.8636, FERPlus test data of 0.8488, and RAF-DB test data of 0.8879. From this study, the proposed method outperformed plain ResNet in all test scenarios without transfer learning, and there is a potential for better performance with the pre-training model. The code is available at https://github.com/yusufrahadika-facial-expressions-essay.

show abstract

Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning

Cited by 26 publications

References 16 publications

A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses

View-Invariant Deep Architecture for Human Action Recognition Using Two-Stream Motion and Shape Temporal Dynamics

Facial Expression Recognition using Residual Convnet with Image Augmentations

Contact Info

Product

Resources

About