Tamara L. Berg scite author profile

In this paper we introduce a new game to crowd-source natural language referring expressions. By designing a two player game, we can both collect and verify referring expressions directly within the game. To date, the game has produced a dataset containing 130,525 expressions, referring to 96,654 distinct objects, in 19,894 photographs of natural scenes. This dataset is larger and more varied than previous REG datasets and allows us to study referring expressions in real-world scenes. We provide an in depth analysis of the resulting dataset. Based on our findings, we design a new optimization based model for generating referring expressions and perform experimental evaluations on 3 test sets.

show abstract

MAttNet: Modular Attention Network for Referring Expression Comprehension

Lin

Shen

et al. 2018

632

754

View full text Add to dashboard Cite

In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: languagebased attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous state-of-the-art methods by a large margin on both bounding-box-level and pixel-level comprehension tasks. Demo 1 and code 2 are provided.

show abstract

Modeling Context in Referring Expressions

et al. 2016

View full text Add to dashboard Cite

Humans refer to objects in their environments all the time, especially in dialogue with other people. We explore generating and comprehending natural language referring expressions for objects in images.In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly. We also develop methods to tie the language generation process together, so that we generate expressions for all objects of a particular category jointly. Evaluation on three recent datasets -RefCOCO, RefCOCO+, and RefCOCOg 1 , shows the advantages of our methods for both referring expression generation and comprehension.

show abstract

Shape Matching and Object Recognition Using Low Distortion Correspondences

View full text Add to dashboard Cite

show abstract

Two-person interaction detection using body-pose features and multiple instance learning

et al. 2012

View full text Add to dashboard Cite

Human activity recognition has potential to impact a wide range of applications from surveillance to human computer interfaces to content based video retrieval. Recently, the rapid development of inexpensive depth sensors (e.g. Microsoft Kinect) provides adequate accuracy for real-time full-body human tracking for activity recognition applications. In this paper, we create a complex human activity dataset depicting two person interactions, including synchronized video, depth and motion capture data. Moreover, we use our dataset to evaluate various features typically used for indexing and retrieval of motion capture data, in the context of real-time detection of interaction activities via Support Vector Machines (SVMs). Experimentally, we find that the geometric relational features based on distance between all pairs of joints outperforms other feature choices. For whole sequence classification, we also explore techniques related to Multiple Instance Learning (MIL) in which the sequence is represented by a bag of body-pose features. We find that the MIL based classifier outperforms SVMs when the sequences extend temporally around the interaction of interest.

show abstract

Parsing clothing in fashion photographs

et al. 2012

View full text Add to dashboard Cite

In this paper we demonstrate an effective method for parsing clothing in fashion photographs, an extremely challenging problem due to the large number of possible garment items, variations in configuration, garment appearance, layering, and occlusion. In addition, we provide a large novel dataset and tools for labeling garment items, to enable future research on clothing estimation. Finally, we present intriguing initial results on using clothing estimates to improve pose identification, and demonstrate a prototype application for pose-independent visual garment retrieval.

show abstract

Baby talk: Understanding and generating simple image descriptions

et al. 2011

View full text Add to dashboard Cite

We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

show abstract

Who are you with and where are you going?

et al. 2011

View full text Add to dashboard Cite

We propose an agent-based behavioral model of pedestrians to improve tracking performance in realistic scenarios. In this model, we view pedestrians as decision-making agents who consider a plethora of personal, social, and environmental factors to decide where to go next. We formulate prediction of pedestrian behavior as an energy minimization on this model. Two of our main contributions are simple, yet effective estimates of pedestrian destination and social relationships (groups). Our final contribution is to incorporate these hidden properties into an energy formulation that results in accurate behavioral prediction. We evaluate both our estimates of destination and grouping, as well as our accuracy at prediction and tracking against state of the art behavioral model and show improvements, especially in the challenging observational situation of infrequent appearance observations -something that might occur in thousands of webcams available on the Internet.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Tamara L. Berg

ReferItGame: Referring to Objects in Photographs of Natural Scenes

MAttNet: Modular Attention Network for Referring Expression Comprehension

Modeling Context in Referring Expressions

Shape Matching and Object Recognition Using Low Distortion Correspondences

Two-person interaction detection using body-pose features and multiple instance learning

Parsing clothing in fashion photographs

Baby talk: Understanding and generating simple image descriptions

Who are you with and where are you going?

Contact Info

Product

Resources

About