2020
DOI: 10.1016/j.neucom.2020.08.035
|View full text |Cite
|
Sign up to set email alerts
|

Video captioning with boundary-aware hierarchical language decoding and joint video prediction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(3 citation statements)
references
References 23 publications
0
3
0
Order By: Relevance
“…Hierarchical encoder structures are also proposed by [30] and [31] that gives more attention to the temporal details of the video. Descriptions are made by employing attention in the decoder section [15], [16] as well as multimodal fusion mechanisms with aural features in the video [32]. A multimodal temporal attention mechanism incorporating image, motion, and audio features is given in [33].…”
Section: Literature Surveymentioning
confidence: 99%
See 1 more Smart Citation
“…Hierarchical encoder structures are also proposed by [30] and [31] that gives more attention to the temporal details of the video. Descriptions are made by employing attention in the decoder section [15], [16] as well as multimodal fusion mechanisms with aural features in the video [32]. A multimodal temporal attention mechanism incorporating image, motion, and audio features is given in [33].…”
Section: Literature Surveymentioning
confidence: 99%
“…Later, attention mechanisms are included in the spatial as well as temporal domain to achieve better performance [13], [14]. Video descriptions can also be generated by employing attention in the decoder section as well as using multimodal fusion mechanisms of visual, text and audio features [15], [16].…”
Section: Introductionmentioning
confidence: 99%
“…State-of-the-art and limitations: The pursuit of multimodalinput-based abstractive text summarization can be related to various other fields, such as image and video captioning [22,34,39,48,49], video story generation [16], video title generation [57], and multimodal sentence summarization [28]. However, these works generally produce summaries based on either images or short videos, and the target summaries are easier to predict due to the limited vocabulary diversity.…”
Section: Introductionmentioning
confidence: 99%