(Unseen) event recognition via semantic compositionality

Stöttinger, Julian; Uijlings, Jasper; Pandey, Anand; Sebe, Nicu; Giunchiglia, Fausto

doi:10.1109/cvpr.2012.6248037

Cited by 3 publications

(2 citation statements)

References 27 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unseen events can be retrieved by a manual definition of such event in terms of attributes. [22] use a manually defined ontology of events in terms of objects to recognise previously unseen events. In contrast, we learn relations between objects and actions from language.…”

Section: Unseen Action/event Recognitionmentioning

confidence: 99%

Exploiting language models to recognize unseen actions

Bernardi

Uijlings

2013

Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

This paper addresses the problem of human action recognition. Typically, visual action recognition systems need visual training examples for all actions that one wants to recognize. However, the total number of possible actions is staggering as not only are there many types of actions but also many possible objects for each action type. Normally, visual training examples are needed for all actions of this combinatorial explosion of possibilities. To address this problem, this paper is a first attempt to propose a general framework for unseen action recognition in still images by exploiting both visual and language models. Based on objects recognized in images by means of visual features, the system suggests the most plausible actions exploiting off-the-shelf language models. All components in the framework are trained on universal datasets, hence the system is general, flexible, and able to recognize actions for which no visual training example has been provided. This paper shows that our model yields good performance on unseen action recognition. It even outperforms a state-of-the-art Bag-of-Words model in a realistic scenario where few visual training examples are available.

show abstract

Section: Unseen Action/event Recognitionmentioning

confidence: 99%

Exploiting language models to recognize unseen actions

Bernardi

Uijlings

2013

Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

“…We treat concepts as the attributes of events in our CBER, which is related to the usage of attributes in object recognition [14,24,5,23,28], action recognition [16,31], image retrieval [25], and event recognition in still images [26]. We explore more informative event representations derived from the semantic concept space, which capture not only the distribution of concepts, but also the co-occurrence relationship between concepts.…”

Section: Related Workmentioning

confidence: 99%

Video event recognition using concept attributes

Liu

Javed

et al. 2013

2013 IEEE Workshop on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

We propose to use action, scene and object concepts as semantic attributes for classification of video events in InTheWild content, such as YouTube videos. We model events using a variety of complementary semantic attribute features developed in a semantic concept space. Our contribution is to systematically demonstrate the advantages of this concept-based event representation (CBER) in applications of video event classification and understanding. Specifically, CBER has better generalization capability, which enables to recognize events with a few training examples. In addition, CBER makes it possible to recognize a novel event without training examples (i.e., zero-shot learning). We further show our proposed enhanced event model can further improve the zero-shot learning. Furthermore, CBER provides a straightforward way for event recounting/understanding. We use the TRECVID Multimedia Event Detection (MED11) open source event definitions and datasets as our test bed and show results on over 1400 hours of videos.

show abstract

Coloring Objects: Adjective-Noun Visual Semantic Compositionality

Nguyen

Lazaridou

Bernardi

2014

Proceedings of the Third Workshop on Vision and Language

View full text Add to dashboard Cite

This paper reports preliminary experiments aiming at verifying the conjecture that semantic compositionality is a general process irrespective of the underlying modality. In particular, we model compositionality of an attribute with an object in the visual modality as done in the case of an adjective with a noun in the linguistic modality. Our experiments show that the concept topologies in the two modalities share similarities, results that strengthen our conjecture. Language and VisionRecently, fields like computational linguistics and computer vision have converged to a common way of capturing and representing the linguistic and visual information of atomic concepts, through vector space models. At the same time, advances in computational semantics have lead to effective and linguistically inspired approaches of extending such methods from single concepts to arbitrary linguistic units (e.g. phrases), through means of vector-based semantic composition (Mitchell and Lapata, 2010).Compositionality is not to be considered only an important component from a linguistic perspective, but also from a cognitive perspective and there has been efforts to validate it as a general cognitive process. However, in computer vision so far compositionality has received limited attention. Thus, in this work, we study the phenomenon of visual compositionality and we complement limited previous literature that has focused on event compositionality (Stöttinger et al., 2012) or general image structure (Socher et al., 2011), by studying models of attribute-object semantic composition.In a nutshell, our work consists of learning vector representations of attribute-object (e.g., "red car", "cute dog" etc.) and objects (e.g., "car", "dog", "truck", "cat" etc.) and by using those compute the representation of new objects having similar attributes ("red truck", "cute cat" etc.). This question has both theoretical and applied impact. The possibility of developing a visual compositional model of attribute-object, on the one hand, could shed light on the acquisition of such ability in humans; how we learn attribute representation and compose them with different objects is still an open question within the cognitive science community (Mintz and Gleitman, 2002). On the other hand, computer vision systems could become generative and be able to recognize unseen attribute-object combinations, a component especially useful for object recognition and image retrieval.

show abstract

(Unseen) event recognition via semantic compositionality

Cited by 3 publications

References 27 publications

Exploiting language models to recognize unseen actions

Exploiting language models to recognize unseen actions

Video event recognition using concept attributes

Coloring Objects: Adjective-Noun Visual Semantic Compositionality

Contact Info

Product

Resources

About