2020
DOI: 10.3389/frobt.2020.475767
|View full text |Cite
|
Sign up to set email alerts
|

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

Abstract: Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, the Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strateg… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 33 publications
(18 citation statements)
references
References 42 publications
(87 reference statements)
0
18
0
Order By: Relevance
“…2 shows the proposed approach's performance and other state-of-the-art methods on the MSVD dataset. The SCN-LSTM [6] and SAVCSS [2] methods process a semantic representation by visual-semantic compositional LSTM decoders without considering the syntactic information. The incorporation of syntactic representation with our compositional modules improves the performance in comparison to those approaches.…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…2 shows the proposed approach's performance and other state-of-the-art methods on the MSVD dataset. The SCN-LSTM [6] and SAVCSS [2] methods process a semantic representation by visual-semantic compositional LSTM decoders without considering the syntactic information. The incorporation of syntactic representation with our compositional modules improves the performance in comparison to those approaches.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Likewise, the superior performance of our sentence generator framework is demonstrated in comparison to models that exploit fixed encoding based on 2D-CNN and 3D-CNN features, such as LSTM-E, SCN-LSTM, SAVCSS. Two recent approaches [10,22] use the syntactic information from the POS tagging structure but 45.3 31.0 --SCN-LSTM [6] 51.1 33.5 77.7 -TDDF [25] 45.8 33.3 73.0 69.7 MTVC [16] 54.5 36.0 92.4 72.8 BAE [1] 42.5 32.4 63.5 -MFATT-TM-SP [13] 52.0 33.5 --ECO [27] 53.5 35.0 85.8 -SibNet [12] 54.2 34.8 88.2 71.7 Joint-VisualPOS [10] 52.8 36.1 87.8 71.5 GFN-POS RL(IR+M) [22] 53.9 34.9 91.0 72.1 hLSTMat [7] 54.3 33.9 73.8 -SAVCSS [2] 61.8 37.8 103.0 76.8 DSD-3 DS-SEM [9] 50.1 34.7 76.0 73.1 ORG-TRL [26] 54.3 36.4 95.2 73.9 SemSynAN (ours) 64.4 41.9 111.5 79.5…”
Section: Experiments and Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The encoder of the proposed model consists of two parts; the first part computes 2D-CNN and 3D-CNN feature vectors. The second part does the job of a 'concept detector', which is the work of Chen et al [15] and Gan et al [31].…”
Section: Feature Extractionmentioning
confidence: 99%
“…Xu et al [138] have proposed a model that solves dense video captioning with the 'Hierarchical Captioning Module' generates captions for videos with controller LSTM. Based on the SCN-LSTM and [15,31]; Martin et al [89] have proposed a new compositional model which works as the decoder. It incorporates several functionalities that were absent in the original model, such as Visualdependent layer, Temporal Attention, Semantic-dependent Layer, Adaptive Attention Gate, and Word Embedding.…”
Section: Sentence Generationmentioning
confidence: 99%