A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Hessel, Jack; Pang, Bo; Zhu, Zhenya; Soricut, Radu

doi:10.18653/v1/k19-1039

Cited by 28 publications

(26 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, the LCM adopts a flat attention model similar to (Ma et al, 2020) to enhance the source video feature by local context. Besides, given multi-modal inputs, LCM is a general model to fuse both the visual features f (v i ) and the text features f (t i ) inside the event with one unified transformer as in (Hessel et al, 2019); 2) we further employ a global context module (GCM) to make the source event to interact with other event features flexibly. The GCM is a cross attention model, which contains one source encoder SEncoder and one cross encoder CEncoder.…”

Section: Hierarchical Context-aware Networkmentioning

confidence: 99%

“…The dense video event captioning task is to produce a sequence of events and generate a descriptive sentence for each event given a long untrimmed video. In this work, we focus only on the task to generate captions and directly apply the ground-truth event proposals similar to (Hessel et al, 2019;Iashin and Rahtu, 2020b). The paradigm for video captioning is an encoder-decoder network, which inputs video features and outputs descriptions for each event.…”

Section: Preliminarymentioning

confidence: 99%

“…The context-agnostic model of captioning is to generate a descriptive sentence given the shorttrimmed video clip of each event. The paradigm for multi-modal video captioning is an encoderdecoder network as in (Hessel et al, 2019). First, we pre-process each event and extract features separately.…”

Section: Context-agnostic Modelmentioning

confidence: 99%

“…The column "V/T" means whether the results come from uni-modal or multi-modal features. Transformer(w/o context) is the base method similar to(Hessel et al, 2019).…”

mentioning

confidence: 99%

See 3 more Smart Citations

Hierarchical Context-aware Network for Dense Video Event Captioning

Ji¹,

Guo²,

Huang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the videolevel HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/ KirkGuo/HCN.

show abstract

Section: Hierarchical Context-aware Networkmentioning

confidence: 99%

Section: Preliminarymentioning

confidence: 99%

Section: Context-agnostic Modelmentioning

confidence: 99%

“…The column "V/T" means whether the results come from uni-modal or multi-modal features. Transformer(w/o context) is the base method similar to(Hessel et al, 2019).…”

mentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical Context-aware Network for Dense Video Event Captioning

Ji¹,

Guo²,

Huang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…Related Tasks A related task is localizing and classifying steps in instructional videos (Alayrac et al, 2016;Zhukov et al, 2019) where they detect when an action is performed in the video whereas we focus on describing actions. Dense event captioning of instructional videos (Zhou et al, 2018b;Li et al, 2018;Hessel et al, 2019) relies on human curated, densely labeled datasets whereas we extract descriptions of videos automatically through our alignments.…”

Section: Multi-modal Instructional Datasetsmentioning

confidence: 99%

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

Lin¹,

Rao

Çelikyılmaz

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Many high-level procedural tasks can be decomposed into sequences of instructions that vary in their order and choice of tools. In the cooking domain, the web offers many partially-overlapping text and video recipes (i.e. procedures) that describe how to make the same dish (i.e. high-level task). Aligning instructions for the same dish across different sources can yield descriptive visual explanations that are far richer semantically than conventional textual instructions, providing commonsense insight into how real-world procedures are structured. Learning to align these different instruction sets is challenging because: a) different recipes vary in their order of instructions and use of ingredients; and b) video instructions can be noisy and tend to contain far more information than text instructions. To address these challenges, we first use an unsupervised alignment algorithm that learns pairwise alignments between instructions of different recipes for the same dish. We then use a graph algorithm to derive a joint alignment between multiple text and multiple video recipes for the same dish. We release the MICROSOFT RESEARCH MUL-TIMODAL ALIGNED RECIPE CORPUS 1 containing ∼150K pairwise alignments between recipes across 4,262 dishes with rich commonsense information.

show abstract

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Han

Liu

Zhang

et al. 2023

Complex Intell. Syst.

View full text Add to dashboard Cite

Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.

show abstract

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Cited by 28 publications

References 47 publications

Hierarchical Context-aware Network for Dense Video Event Captioning

Hierarchical Context-aware Network for Dense Video Event Captioning

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Contact Info

Product

Resources

About