This paper reports preliminary experiments aiming at verifying the conjecture that semantic compositionality is a general process irrespective of the underlying modality. In particular, we model compositionality of an attribute with an object in the visual modality as done in the case of an adjective with a noun in the linguistic modality. Our experiments show that the concept topologies in the two modalities share similarities, results that strengthen our conjecture.
Language and VisionRecently, fields like computational linguistics and computer vision have converged to a common way of capturing and representing the linguistic and visual information of atomic concepts, through vector space models. At the same time, advances in computational semantics have lead to effective and linguistically inspired approaches of extending such methods from single concepts to arbitrary linguistic units (e.g. phrases), through means of vector-based semantic composition (Mitchell and Lapata, 2010).Compositionality is not to be considered only an important component from a linguistic perspective, but also from a cognitive perspective and there has been efforts to validate it as a general cognitive process. However, in computer vision so far compositionality has received limited attention. Thus, in this work, we study the phenomenon of visual compositionality and we complement limited previous literature that has focused on event compositionality (Stöttinger et al., 2012) or general image structure (Socher et al., 2011), by studying models of attribute-object semantic composition.In a nutshell, our work consists of learning vector representations of attribute-object (e.g., "red car", "cute dog" etc.) and objects (e.g., "car", "dog", "truck", "cat" etc.) and by using those compute the representation of new objects having similar attributes ("red truck", "cute cat" etc.). This question has both theoretical and applied impact. The possibility of developing a visual compositional model of attribute-object, on the one hand, could shed light on the acquisition of such ability in humans; how we learn attribute representation and compose them with different objects is still an open question within the cognitive science community (Mintz and Gleitman, 2002). On the other hand, computer vision systems could become generative and be able to recognize unseen attribute-object combinations, a component especially useful for object recognition and image retrieval.