Stefan Lee scite author profile

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.

show abstract

Embodied Question Answering

Das

et al. 2018

View full text Add to dashboard Cite

We present a new AI task -Embodied Question Answering (EmbodiedQA) -where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?').In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ('orange'). This challenging task requires a range of AI skills -active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.

show abstract

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

et al. 2017

View full text Add to dashboard Cite

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -Q-BOT and A-BOT-who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset [4], where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Q-BOT learns to ask questions that A-BOT is good at, ultimately resulting in more informative dialog and a better team.works [4,5] first collect a dataset of human-human dialog, i.e., a sequence of question-answer pairs about an image (q 1 , a 1 ), . . . , (q T , a T ). Next, a machine (a deep neural network) is provided with the image I, the human dialog recorded till round t − 1, (q 1 , a 1 ), . . . , (q t−1 , a t−1 ), the follow-up question q t , and is supervised to generate the human response a t . Essentially, at each round t, the machine is artificially 'injected' into the conversation between two humans and asked to answer the question q t ; but the machine's answerâ t is thrown away, because at the next round t + 1, the machine is again provided with the 'ground-truth' human-human dialog that includes the human response a t and not the machine responseâ t . Thus, the machine is never allowed to steer the conversation because that would take the dialog out of the dataset, making it non-evaluable.

show abstract

Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

et al. 2015

View full text Add to dashboard Cite

Hands appear very often in egocentric video, and their appearance and pose give important cues about what people are doing and what they are paying attention to. But existing work in hand detection has made strong assumptions that work well in only simple scenarios, such as with limited interaction with other people or in lab settings. We develop methods to locate and distinguish between hands in egocentric video using strong appearance models with Convolutional Neural Networks, and introduce a simple candidate region generation approach that outperforms existing techniques at a fraction of the computational cost. We show how these high-quality bounding boxes can be used to create accurate pixelwise hand regions, and as an application, we investigate the extent to which hand segmentation alone can distinguish between different activities. We evaluate these techniques on a new dataset of 48 first-person videos of people interacting in realistic environments, with pixel-level ground truth for over 15,000 hand instances.

show abstract

12-in-1: Multi-Task Vision and Language Representation Learning

et al. 2020

View full text Add to dashboard Cite

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

et al. 2020

View full text Add to dashboard Cite

nocaps: novel object captioning at scale

Agrawal

Anderson²,

Desai³

et al. 2019

192

163

View full text Add to dashboard Cite

show abstract

Audio Visual Scene-Aware Dialog

et al. 2019

View full text Add to dashboard Cite

We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Stefan Lee

Graph R-CNN for Scene Graph Generation

Embodied Question Answering

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

12-in-1: Multi-Task Vision and Language Representation Learning

Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

nocaps: novel object captioning at scale

Audio Visual Scene-Aware Dialog

Contact Info

Product

Resources

About