Exploiting language models to recognize unseen actions

Le, Dieu Thu; Bernardi, Raffaella; Uijlings, Jasper

doi:10.1145/2461466.2461504

Cited by 21 publications

(11 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Verbs Acts Images Sen Des PPMI (Yao and Fei-Fei, 2010) 2 24 4800 N N Stanford 40 Actions (Yao et al, 2011) 33 40 9532 N N PASCAL 2012(Everingham et al, 2015 9 11 4588 N N 89 Actions (Le et al, 2013) 36 89 2038 N N TUHOI (Le et al, 2014) -297410805 N N COCO-a (Ronchi and Perona, 2015 140 162 10000 N Y HICO (Chao et al, 2015) 111 600 47774 Y N VerSe (our dataset) 90 163 3518 Y Y…”

Section: Datasetmentioning

confidence: 91%

Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

Gella

Lapata²,

Keller³

2016

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

We introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce VerSe, a new dataset that augments existing multimodal datasets (COCO and TUHOI) with sense labels. We propose an unsupervised algorithm based on Lesk which performs visual sense disambiguation using textual, visual, or multimodal embeddings. We find that textual embeddings perform well when goldstandard textual annotations (object labels and image descriptions) are available, while multimodal embeddings perform well on unannotated images. We also verify our findings by using the textual and multimodal embeddings as features in a supervised setting and analyse the performance of visual sense disambiguation task. VerSe is made publicly available and can be downloaded at: https://github. com/spandanagella/verse.

show abstract

Section: Datasetmentioning

confidence: 91%

Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

Gella

Lapata²,

Keller³

2016

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

show abstract

“…Action images in sports (Gupta, Kembhavi, and Davis 2009;Li and Li 2007) are among the earliest datasets introduced for research. Daily activity datasets Le, Bernardi, and Uijlings 2013) contain common human activities in daily life. The latest version of Pascal VOC (Maji, Bourdev, and Malik 2011) competition includes ten categories of still image actions, with only a subset of people annotated (bounding box + action).…”

Section: Related Workmentioning

confidence: 99%

UCF-STAR: A Large Scale Still Image Dataset for Understanding Human Actions

Safaei¹,

Balouchian²,

Foroosh³

2020

AAAI

View full text Add to dashboard Cite

Action recognition in still images poses a great challenge due to (i) fewer available training data, (ii) absence of temporal information. To address the first challenge, we introduce a dataset for STill image Action Recognition (STAR), containing over $1M$ images across 50 different human body-motion action categories. UCF-STAR is the largest dataset in the literature for action recognition in still images. The key characteristics of UCF-STAR include (1) focusing on human body-motion rather than relatively static human-object interaction categories, (2) collecting images from the wild to benefit from a varied set of action representations, (3) appending multiple human-annotated labels per image rather than just the action label, and (4) inclusion of rich, structured and multi-modal set of metadata for each image. This departs from existing datasets, which typically provide single annotation in a smaller number of images and categories, with no metadata. UCF-STAR exposes the intrinsic difficulty of action recognition through its realistic scene and action complexity. To benchmark and demonstrate the benefits of UCF-STAR as a large-scale dataset, and to show the role of “latent” motion information in recognizing human actions in still images, we present a novel approach relying on predicting temporal information, yielding higher accuracy on 5 widely-used datasets.

show abstract

“…Earlier efforts such as Gupta et al [2009], Everingham et al [2010], Yao and Fei-Fei [2010], Yao et al [2011], Le et al [2013] used in-house annotators to label 6-89 human actions (such as "reading," "riding a bike," "playing guitar," or "holding a guitar").…”

Section: Actions and Interactions In Imagesmentioning

confidence: 99%