Recognition using visual phrases

Sadeghi, Maryam; Farhadi, Ali

doi:10.1109/cvpr.2011.5995711

Cited by 378 publications

(371 citation statements)

References 11 publications

Supporting

Mentioning

366

Contrasting

Order By: Relevance

“…In the second experiment, we learn object and attribute classifiers jointly and predict object-attribute pairs (e.g. predicting that an apple is red), as in Sadeghi and Farhadi (2011).…”

Section: Methodsmentioning

confidence: 99%

“…Related Work Comparison It is also worth mentioning in this section some prior work on relationships. The concept of visual relationships has already been explored in Visual Phrases (Sadeghi and Farhadi 2011), who introduced a dataset of 17 such relationships such as next_to (person, bike) and riding (person, horse). However, their dataset is limited to just these 17 relationships.…”

Section: Top Relationship Distributionsmentioning

confidence: 99%

“…feeding) are shown (bottom). Our dataset also contains image related question answer pairs (not shown) objects (Isola et al 2015) and understand their interactions within a scene (Sadeghi and Farhadi 2011).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

3,864

3,016

View full text Add to dashboard Cite

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest

show abstract

“…In the second experiment, we learn object and attribute classifiers jointly and predict object-attribute pairs (e.g. predicting that an apple is red), as in Sadeghi and Farhadi (2011).…”

Section: Methodsmentioning

confidence: 99%

Section: Top Relationship Distributionsmentioning

confidence: 99%

See 1 more Smart Citation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

3,864

3,016

View full text Add to dashboard Cite

show abstract

“…Each subcategory has reduced appearance diversity (via improved alignment), leading to a simpler learning problem. The recent success of the discriminatively-trained mixture model framework of Felzenszwalb et al, [8] has led to the wide popularity of such models for object detection [14,17,18,20,23]. Applying such model to the four images in Figure 1(a) would likely result in each being assigned to a separate subcategory and trained with others of its kind.…”

Section: Introductionmentioning

confidence: 99%

Object Instance Sharing by Enhanced Bounding Box Correspondence

Divvala

Efros

Hebert

2012

Procedings of the British Machine Vision Conference 2012

View full text Add to dashboard Cite

Most contemporary object detection approaches assume each object instance in the training data to be uniquely represented by a single bounding box. In this paper, we go beyond this conventional view by allowing an object instance to be described by multiple bounding boxes. The new bounding box annotations are determined based on the alignment of an object instance with the other training instances in the dataset. Our proposal enables the training data to be reused multiple times for training richer multi-component category models. We operationalize this idea by two complementary operations: bounding box shrinking, which finds subregions of an object instance that could be shared; and bounding box enlarging, which enlarges object instances to include local contextual cues. We empirically validate our approach on the PASCAL VOC detection dataset.

show abstract

“…Gupta et al [3] use the AND-OR graph formalism to represent spatiotemporal relations among objects and actions in videos. Sadeghi and Farhadi [4] examine the scale of unit at which to categorize objects, and develop a notion of visual phrases for jointly recognizing co-occurring objects. Farhadi et al [5] develop image models that indicate the presence of object, action, scene triplets.…”

Section: Introductionmentioning

confidence: 99%

Image Retrieval with Structured Object Queries Using Latent Ranking SVM

Lan

Yang

Wang

et al. 2012

Computer Vision – ECCV 2012

View full text Add to dashboard Cite

Abstract. We consider image retrieval with structured object queriesqueries that specify the objects that should be present in the scene, and their spatial relations. An example of such queries is "car on the road". Existing image retrieval systems typically consider queries consisting of object classes (i.e. keywords). They train a separate classifier for each object class and combine the output heuristically. In contrast, we develop a learning framework to jointly consider object classes and their relations. Our method considers not only the objects in the query ("car" and "road" in the above example), but also related object categories can be useful for retrieval. Since we do not have ground-truth labeling of object bounding boxes on the test image, we represent them as latent variables in our model. Our learning method is an extension of the ranking SVM with latent variables, which we call latent ranking SVM. We demonstrate image retrieval and ranking results on a dataset with more than a hundred of object classes.

show abstract

Recognition using visual phrases

Cited by 378 publications

References 11 publications

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Object Instance Sharing by Enhanced Bounding Box Correspondence

Image Retrieval with Structured Object Queries Using Latent Ranking SVM

Contact Info

Product

Resources

About