Samyak Datta scite author profile

We present a new AI task -Embodied Question Answering (EmbodiedQA) -where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?').In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ('orange'). This challenging task requires a range of AI skills -active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.

show abstract

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Wijmans

et al. 2019

View full text Add to dashboard Cite

To help bridge the gap between internet vision-style problems and the goal of vision for embodied perception we instantiate a large-scale navigation task -Embodied Question Answering [1] in photo-realistic environments (Matterport 3D). We thoroughly study navigation policies that utilize 3D point clouds, RGB images, or their combination. Our analysis of these models reveals several key findings. We find that two seemingly naive navigation baselines, forward-only and random, are strong navigators and challenging to outperform, due to the specific choice of the evaluation setting presented by [1]. We find a novel lossweighting scheme we call Inflection Weighting to be important when training recurrent models for navigation with behavior cloning and are able to out perform the baselines with this technique. We find that point clouds provide a richer signal than RGB images for learning obstacle avoidance, motivating the use (and continued study) of 3D deep learning models for embodied navigation.

show abstract

Embodied Question Answering

et al. 2018

View full text Add to dashboard Cite

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

Datta

Sikka

Roy

et al. 2019

View full text Add to dashboard Cite

We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-toimage retrieval as a "downstream" task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs. In the subsequent step, this learned representation is aligned with the caption. Our key contribution lies in building this "captionconditioned" image encoding which tightly couples both the tasks and allows the weak supervision to effectively guide visual grounding. We provide extensive empirical and qualitative analysis to investigate the different components of our proposed model and compare it with competitive baselines. For phrase localization, we report improvements of 4.9% and 1.3% (absolute) over prior state-of-the-art on the VisualGenome and Flickr30k Entities datasets. We also report results that are at par with the state-of-the-art on the downstream caption-to-image retrieval task on COCO and Flickr30k datasets.Recent works [20,21] have shown evidence that operating under such a paradigm helps boost performance for imagecaption matching. Generally, these models consist of two stages: (1) a local matching module that infers the latent region-phrase correspondences to generate local matching information, and (2) a global matching module that uses this information to perform image-caption matching. This setup allows phrase grounding to act as an intermediate and a prerequisite task for image-caption matching. It is important to note that the primary objective of such works has been on image-caption matching and not phrase grounding.An artifact of training under such a paradigm is the amplification of correlations between selective regions and phrases."Young girl holding a kitten" by Gennadiy Kolodkin is licensed under CC BY-NC-ND 2.0.

show abstract

Embodied Question Answering

Das¹,

Datta²,

Gkioxari³

et al. 2017

Preprint

View full text Add to dashboard Cite

Unsupervised Learning of Face Representations

Datta

Sharma

Jawahar

2018

View full text Add to dashboard Cite

We present an approach for unsupervised training of CNNs in order to learn discriminative face representations. We mine supervised training data by noting that multiple faces in the same video frame must belong to different persons and the same face tracked across multiple frames must belong to the same person. We obtain millions of face pairs from hundreds of videos without using any manual supervision. Although faces extracted from videos have a lower spatial resolution than those which are available as part of standard supervised face datasets such as LFW and CASIA-WebFace, the former represent a much more realistic setting, e.g. in surveillance scenarios where most of the faces detected are very small. We train our CNNs with the relatively low resolution faces extracted from video frames collected, and achieve a higher verification accuracy on the benchmark LFW dataset cf. hand-crafted features such as LBPs, and even surpasses the performance of state-of-the-art deep networks such as VGG-Face, when they are made to work with low resolution input images.

show abstract

Integrating Geometric and Textural Features for Facial Emotion Classification Using SVM Frameworks

Datta

Sen

Balasubramanian

2016

View full text Add to dashboard Cite

Facial emotion classification using concatenated geometric and textural features

Sen

Datta

Raman

2018

Multimed Tools Appl

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Samyak Datta

Embodied Question Answering

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Embodied Question Answering

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

Embodied Question Answering

Unsupervised Learning of Face Representations

Integrating Geometric and Textural Features for Facial Emotion Classification Using SVM Frameworks

Facial emotion classification using concatenated geometric and textural features

Contact Info

Product

Resources

About