Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) 2019
DOI: 10.18653/v1/k19-1039
|View full text |Cite
|
Sign up to set email alerts
|

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Abstract: Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(26 citation statements)
references
References 47 publications
0
26
0
Order By: Relevance
“…Specifically, the LCM adopts a flat attention model similar to (Ma et al, 2020) to enhance the source video feature by local context. Besides, given multi-modal inputs, LCM is a general model to fuse both the visual features f (v i ) and the text features f (t i ) inside the event with one unified transformer as in (Hessel et al, 2019); 2) we further employ a global context module (GCM) to make the source event to interact with other event features flexibly. The GCM is a cross attention model, which contains one source encoder SEncoder and one cross encoder CEncoder.…”
Section: Hierarchical Context-aware Networkmentioning
confidence: 99%
See 3 more Smart Citations
“…Specifically, the LCM adopts a flat attention model similar to (Ma et al, 2020) to enhance the source video feature by local context. Besides, given multi-modal inputs, LCM is a general model to fuse both the visual features f (v i ) and the text features f (t i ) inside the event with one unified transformer as in (Hessel et al, 2019); 2) we further employ a global context module (GCM) to make the source event to interact with other event features flexibly. The GCM is a cross attention model, which contains one source encoder SEncoder and one cross encoder CEncoder.…”
Section: Hierarchical Context-aware Networkmentioning
confidence: 99%
“…The dense video event captioning task is to produce a sequence of events and generate a descriptive sentence for each event given a long untrimmed video. In this work, we focus only on the task to generate captions and directly apply the ground-truth event proposals similar to (Hessel et al, 2019;Iashin and Rahtu, 2020b). The paradigm for video captioning is an encoder-decoder network, which inputs video features and outputs descriptions for each event.…”
Section: Preliminarymentioning
confidence: 99%
See 2 more Smart Citations
“…Related Tasks A related task is localizing and classifying steps in instructional videos (Alayrac et al, 2016;Zhukov et al, 2019) where they detect when an action is performed in the video whereas we focus on describing actions. Dense event captioning of instructional videos (Zhou et al, 2018b;Li et al, 2018;Hessel et al, 2019) relies on human curated, densely labeled datasets whereas we extract descriptions of videos automatically through our alignments.…”
Section: Multi-modal Instructional Datasetsmentioning
confidence: 99%