Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413498
|View full text |Cite
|
Sign up to set email alerts
|

Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 19 publications
(14 citation statements)
references
References 23 publications
0
13
0
Order By: Relevance
“…Comparing to CNN, the attention mechanism learns more global dependencies, therefore, transformer also shows great performance in low-level tasks [3]. Transformer has also been proved effectiveness in multi-modal area, including multi-modal representations [45] and applications [13,19,31]. Inspired by the extensive applications of transformer, we integrate the transformer encoder-decoder into the document image rectification problem.…”
Section: Transformer In Language and Visionmentioning
confidence: 99%
“…Comparing to CNN, the attention mechanism learns more global dependencies, therefore, transformer also shows great performance in low-level tasks [3]. Transformer has also been proved effectiveness in multi-modal area, including multi-modal representations [45] and applications [13,19,31]. Inspired by the extensive applications of transformer, we integrate the transformer encoder-decoder into the document image rectification problem.…”
Section: Transformer In Language and Visionmentioning
confidence: 99%
“…Recent researches [3,33,38,42,47] mainly focus on modeling the relationship between fixed video representations and the output textual descriptions via an encoder-decoder framework for video captioning. Specifically, these methods [12,27,30,33,59] employ an encoder to refine video representations from a set of fixed video frame features, and a language decoder operates on top of these refined video representations to learn visual-textual alignment for caption generation.…”
Section: Related Workmentioning
confidence: 99%
“…Video captioning [3,12,27,30,38,42,47,48,59] is the task of describing the visual content of a given video in natural language. As such, it requires an algorithm to understand Different from prior works that use offline-extracted 2D/3D features, we propose to adopt the video transformer as our video encoder, and present an end-to-end fully Transformer-based model for video captioning.…”
Section: Introductionmentioning
confidence: 99%
“…Although retrieval-based methods can find human-like sentences with similar semantics to the video, it is challenging to generate an entirely correct description due to limited retrieval samples. With the advent of the encoder-decoder framework, most of the current work is studying how to better use visual features [35,23,1,36,22,24] and design elaborate models [34,2,28,13] to generate sentences directly. However, the diversity and controllability of sentences generated in this way are not satisfactory.…”
Section: Related Workmentioning
confidence: 99%
“…Video captioning is one of the most important visionlanguage tasks, and it seeks to automatically describe what has happened in the video according to the visual content. Recently, many promising methods [36,22,24,34,2] have been proposed to address this task. These methods mainly focus on learning the spatial-temporal representations of videos to fully tap visual information and devising novel decoders to achieve visual-textual alignment or controllable decoding.…”
Section: Introductionmentioning
confidence: 99%