HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do

Curtis, K.M.; Awad, George; Rajput, Shahzad; Soboroff, Ian

doi:10.1145/3372278.3390742

Cited by 27 publications

(9 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The HLVU dataset (Table 1) has 10 open source movies sampled from paper [4]. The training set includes four long and two short movies, while testing set includes two long and two short movies.…”

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Story Semantic Relationships from Multimodal Cognitions

Anand

Ramesh

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We consider the problem of building semantic relationship of unseen entities from free-form multi-modal sources. This intelligent agent understands semantic properties by creating (1) logical segments from sources, (2) finds interacting objects, (3) infers their interaction actions using (4) extracted textual, auditory, visual, and tonal information. The conversational dialogue discourses are automatically mapped to interacting co-located objects, and fused with their Kinetic action embeddings at each scene of occurrence. This generates a combined probability distribution representation for interacting entities spanning over every semantic relation class. Using these probabilities, we create knowledge graphs capable of answering semantic queries and infer missing properties in a given context.

show abstract

Section: Datasetmentioning

confidence: 99%

“…The entities provided in the HLVU dataset [4] include person, objects, locations and concepts for which relevant images are provided for mapping. The locations and object entities are localized within scenes using SIFT based feature matching to handle varying scales and crops.…”

Section: Object Detection and Mappingmentioning

confidence: 99%

Story Semantic Relationships from Multimodal Cognitions

Anand

Ramesh

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…The High-Level Video Understanding (HLVU) dataset [4] includes 10 movies that are suitable for researching the relationship between Figure 1: Architecture of the multi-modal fusion model entities. The HLVU dataset meets the important requirements for selecting movies such as the duration of the movies (different lengths of movies: 6 long movies and 4 short movies), the quality of the video, and the clarity of the storyline.…”

Section: Datasetmentioning

confidence: 99%

Kinetics and Scene Features for Intent Detection

Ramesh

Anand

Wang

et al. 2020

Companion Publication of the 2020 International Conference on Multimodal Interaction

View full text Add to dashboard Cite

“…The challenge uses the recently introduced High Level Video Understanding (HLVU) dataset [5] which consists of 10 movies released under creative commons licenses with a total combined duration of 681 minutes. For each of the 10 videos, the dataset also provides cropped key-frames, showing either characters or locations which are relevant for the story told by the movies.…”

Section: Provided Datamentioning

confidence: 99%

“…mechanism. 5 Detected faces are then compared against the provided example images in order to identify the visible person. Due to changes in size and orientation of the people on screen, as well as variations of overall image quality throughout the videos, we apply the face detection and identification method densely, i.e., on every frame of the videos.…”

Section: Data Pre-processingmentioning

confidence: 99%

Towards Using Semantic-Web Technologies for Multi-Modal Knowledge Graph Construction

Baumgartner

Rossetto

Bernstein

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

While a multitude of approaches for extracting semantic information from multimedia documents has emerged in recent years, isolating any form of holistic semantic representation from a larger type of document, such as a movie, is not yet feasible. In this paper we present our approaches used in the first instance of the Deep Video Understanding Challenge, using a combination of several multimodal detectors and an integration scheme informed by methods from the semantic web context in order to determine the capabilities limitations of currently available methods for the extraction of semantic relations between the characters and locations relevant to the narrative of a movie.

show abstract

HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do

Cited by 27 publications

References 8 publications

Story Semantic Relationships from Multimodal Cognitions

Story Semantic Relationships from Multimodal Cognitions

Kinetics and Scene Features for Intent Detection

Towards Using Semantic-Web Technologies for Multi-Modal Knowledge Graph Construction

Contact Info

Product

Resources

About