Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3416305
|View full text |Cite
|
Sign up to set email alerts
|

Story Semantic Relationships from Multimodal Cognitions

Abstract: We consider the problem of building semantic relationship of unseen entities from free-form multi-modal sources. This intelligent agent understands semantic properties by creating (1) logical segments from sources, (2) finds interacting objects, (3) infers their interaction actions using (4) extracted textual, auditory, visual, and tonal information. The conversational dialogue discourses are automatically mapped to interacting co-located objects, and fused with their Kinetic action embeddings at each scene of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 14 publications
0
2
0
Order By: Relevance
“…Recognizing fine-grained social relationships & actions from small-scale datasets is a challenging task that requires us to leverage transfer learning to boost the generalisability of the model. In this paper we extend the recent work [1] on semantic relationship understanding by learning holistic scene and text representations. We present a discussion on the most useful feature sets to model the target task of pairwise relation prediction which can be useful for other video understanding tasks such as visual question answering that involve multimodal sources.…”
Section: Background and Related Workmentioning
confidence: 90%
See 1 more Smart Citation
“…Recognizing fine-grained social relationships & actions from small-scale datasets is a challenging task that requires us to leverage transfer learning to boost the generalisability of the model. In this paper we extend the recent work [1] on semantic relationship understanding by learning holistic scene and text representations. We present a discussion on the most useful feature sets to model the target task of pairwise relation prediction which can be useful for other video understanding tasks such as visual question answering that involve multimodal sources.…”
Section: Background and Related Workmentioning
confidence: 90%
“…In the following section, we present a brief overview of the building blocks that is constructed on top of baseline pipeline [1].…”
Section: Methodsmentioning
confidence: 99%