2019
DOI: 10.1109/access.2019.2942000
|View full text |Cite
|
Sign up to set email alerts
|

Video Captioning With Adaptive Attention and Mixed Loss Optimization

Abstract: The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as ''of'' and ''the'', may mislead the decoder and decrease the overall performance of video captioning. Furthermore, the traditional sequence to sequence framework optimizes the model by using word-level cross entropy loss, which results in an exposure bias problem. This problem occurs because, at test time, t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 36 publications
(61 reference statements)
0
3
0
Order By: Relevance
“…For example, methods [4], [16] extract visual features from a video using CNNs, encode the video features, and then use LSTMs to decode them into natural language. To further improve the captioning quality, subsequent work extends the encoder-decoder structure by consolidating temporal attention mechanism [5], hierarchical RNNs [6], [8], LSTMs with visual-semantic embedding [7], semantic decoder [9], [10], [11], spatial-hard attention [31], reconstruction network [12], and reinforced adaptive attention [32]. Despite these efforts, they are limited by generating a single sentence from a trimmed video.…”
Section: Related Work a Video Captioning Tasks 1) Video Captioningmentioning
confidence: 99%
“…For example, methods [4], [16] extract visual features from a video using CNNs, encode the video features, and then use LSTMs to decode them into natural language. To further improve the captioning quality, subsequent work extends the encoder-decoder structure by consolidating temporal attention mechanism [5], hierarchical RNNs [6], [8], LSTMs with visual-semantic embedding [7], semantic decoder [9], [10], [11], spatial-hard attention [31], reconstruction network [12], and reinforced adaptive attention [32]. Despite these efforts, they are limited by generating a single sentence from a trimmed video.…”
Section: Related Work a Video Captioning Tasks 1) Video Captioningmentioning
confidence: 99%
“…For example, basic frameworks [4], [13] capture visual features from a video using CNNs, then encode the video features and decode them into natural language using LSTMs. To improve the captioning quality, subsequent work has extended the basic encoder-decoder structure by incorporating a temporal attention mechanism [5], hierarchical RNNs [6], [8], LSTMs with visual-semantic embedding [7], a semantic decoder [9]- [11], reconstruction network [12], spatial-hard attention [14], and reinforced adaptive attention [15]. Despite a lot of effort, they are limited to generating a single sentence from a video.…”
Section: A Video Captioningmentioning
confidence: 99%
“…At that point, the visual and semantic stream together produced an important textual description from a visual scene. Another approach that included adaptive attention and mixed loss optimization for performing video captioning was presented by Xiao et al [21]. The technique included a reinforced adaptive attention mechanism.…”
Section: End-to-end Modelmentioning
confidence: 99%