Story Semantic Relationships from Multimodal Cognitions

Anand, Vishal; Ramesh, Raksha; Wang, Ziyin; Feng, Yijing; Feng, Jiana; Lyu, Wenfeng; Zhu, Tianle; Yuan, Serena; Lin, Ching‐Yung

doi:10.1145/3394171.3416305

Cited by 4 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recognizing fine-grained social relationships & actions from small-scale datasets is a challenging task that requires us to leverage transfer learning to boost the generalisability of the model. In this paper we extend the recent work [1] on semantic relationship understanding by learning holistic scene and text representations. We present a discussion on the most useful feature sets to model the target task of pairwise relation prediction which can be useful for other video understanding tasks such as visual question answering that involve multimodal sources.…”

Section: Background and Related Workmentioning

confidence: 90%

See 1 more Smart Citation

Kinetics and Scene Features for Intent Detection

Ramesh

Anand

Wang

et al. 2020

Companion Publication of the 2020 International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

Section: Background and Related Workmentioning

confidence: 90%

“…In the following section, we present a brief overview of the building blocks that is constructed on top of baseline pipeline [1].…”

Section: Methodsmentioning

confidence: 99%

Kinetics and Scene Features for Intent Detection

Ramesh

Anand

Wang

et al. 2020

Companion Publication of the 2020 International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

Latent Memory-augmented Graph Transformer for Visual Storytelling

Qin

Huang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual storytelling aims to automatically generate a human-like short story given an image stream. Most existing works utilize either scene-level or object-level representations, neglecting the interaction among objects in each image and the sequential dependency between consecutive images. In this paper, we present a novel Latent Memory-augmented Graph Transformer (LMGT), a Transformer based framework for visual story generation. LMGT directly inherits the merits from the Transformer, which is further enhanced with two carefully designed components, i.e., a graph encoding module and a latent memory unit. Specifically, the graph encoding module exploits the semantic relationships among image regions and attentively aggregates critical visual features based on the parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and topic consistency, we introduce an augmented latent memory unit that learns and records highly summarized latent information as the story line from the image stream and the sentence history. Experimental results on three widely-used datasets demonstrate the superior performance of LMGT over the state-of-the-art methods. CCS CONCEPTS• Computing methodologies → Natural language generation; Scene understanding.

show abstract