Tanmay Gupta scite author profile

We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop training techniques that improve learning efficiency by: (1) eliminating a train-inference mismatch; (2) rejecting easy negatives during mini-batch training; and (3) using a ratio of negatives to positives that is two orders of magnitude larger than existing approaches. We conduct a thorough ablation study to understand the importance of different factors and training techniques using the challenging HICO-Det dataset [4].

show abstract

Completing 3D object shape from one depth image

Rock

et al. 2015

View full text Add to dashboard Cite

Our goal is to recover a complete 3D model from a depth image of an object. Existing approaches rely on user interaction or apply to a limited class of objects, such as chairs. We aim to fully automatically reconstruct a 3D model from any category. We take an exemplar-based approach: retrieve similar objects in a database of 3D models using view-based matching and transfer the symmetries and surfaces from retrieved models. We investigate completion of 3D models in three cases: novel view (model in database); novel model (models for other objects of the same category in database); and novel category (no models from the category in database).

show abstract

Contrastive Learning for Weakly Supervised Phrase Grounding

Gupta

Vahdat

Chechik

et al. 2020

View full text Add to dashboard Cite

Imagine This! Scripts to Compositions to Videos

Gupta

Schwenk

Farhadi

et al. 2018

View full text Add to dashboard Cite

Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval and Fusion Network (Craft), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Craft explicitly predicts a temporal-layout of mentioned entities (characters and objects), retrieves spatio-temporal entity segments from a video database and fuses them to generate scene videos. Our contributions include sequential training of components of Craft while jointly modeling layout and appearances, and losses that encourage learning compositional representations for retrieval. We evaluate Craft on semantic fidelity to caption, composition consistency, and visual quality. Craft outperforms direct pixel generation approaches and generalizes well to unseen captions and to unseen video databases with no text annotations. We demonstrate Craft on Flintstones, a new richly annotated video-caption dataset with over 25000 videos. For a glimpse of videos generated by Craft, see https://youtu.be/688Vv86n0z8.

show abstract

Contrastive Learning for Weakly Supervised Phrase Grounding

Gupta¹,

Vahdat²,

Chechik³

et al. 2020

Preprint

View full text Add to dashboard Cite

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a ∼ 10% absolute gain in accuracy over randomlysampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of 5.7% to achieve 76.7% accuracy on Flickr30K Entities benchmark.

show abstract

Towards General Purpose Vision Systems

Gupta¹,

Kamath²,

Kembhavi³

et al. 2021

Preprint

View full text Add to dashboard Cite

A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system's ability to learn multiple skills simultaneously, to perform tasks with novel skillconcept combinations, and to learn new skills efficiently and without forgetting.

show abstract

Visual Semantic Role Labeling for Video Understanding

Sadhu

Gupta

Yatskar

et al. 2021

View full text Add to dashboard Cite

Modelling enablers of efficiency and sustainability of healthcare: a m-TISM approach

Sindhwani

Kumar

Behl

et al. 2021

BIJ

View full text Add to dashboard Cite

PurposeIt would not be an exaggeration to say that healthcare is the most crucial one in today's perspective. The healthcare sector, in general, is engaged in working on various dimensions simultaneously like the safety, care, quality and cost of services, etc. Still, the desired outcomes from this sector are far away, and it becomes pertinent to address all such issues associated with healthcare on a priority basis for sustaining the outcomes in a long-term perspective. The present study aims to explore the healthcare sector and list out the directly associated enablers contributing to increasing the viability of the healthcare sector. Besides, the interrelationship among the enlisted enablers needs to be studied, which further helps in setting-out the priority to deal with individual enablers based on their impedance in the contribution towards viability increment.Design/methodology/approachThe authors have done an extensive review to list out the enablers of the healthcare sector to perform efficiently and effectively. Further, the attempt has been made on the enablers to rank them by using the modified Total Interpretative Structure Modelling (m-TISM) approach. The validation of the study reveals the importance of enablers based on their position in the hierarchical structure. Further, the MICMAC analysis on the identified enabler is performed to categorize the identified enablers in the different clusters based on their driving power and dependence.FindingsThe research tries to envisage the importance of the healthcare sector and its contribution towards national development. The outcomes of the m-TISM model in the present study reveal the noteworthy contribution of the organizational structure in managing the healthcare facilities and represented it as the perspective of future growth. The well-designed organizational structure in the healthcare industry helps in establishing better employee–employer cooperation, workforce coordination and inter-department cooperation.Research limitations/implicationsEvery research work has limitations. Likewise, the present research work also has limitations, i.e. input taken for developing the models are from very few experts that may not reflect the opinion of the whole sector.Practical implicationsThe healthcare sector is the growing sector in the present-day scenario, and it is essential to keep the quality of treatment in check along with the quantity. The present study has laid down the practical foundations for improvement in the healthcare sector viability. Besides, the study emphasized on accountability of the healthcare sector officials to go with the enablers having the strong driving power for effective utilization of all the resources. This would further help them in customer (patients) satisfaction.Originality/valueDespite an increase in demand for good quality healthcare facilities worldwide, the growth of this sector is bounded by the economic, demographic, cultural and environmental concerns, etc. The present study proposed a unique framework that provides a better understanding of the enablers. It would further help in playing a key role in increasing the viability of the healthcare sector. The hierarchy developed with the help of m-TISM and MICMAC analysis will help the viewers to recognize the important enablers based on their contribution to the viability improvement of the healthcare sector.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Tanmay Gupta

No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

Completing 3D object shape from one depth image

Contrastive Learning for Weakly Supervised Phrase Grounding

Imagine This! Scripts to Compositions to Videos

Contrastive Learning for Weakly Supervised Phrase Grounding

Towards General Purpose Vision Systems

Visual Semantic Role Labeling for Video Understanding

Modelling enablers of efficiency and sustainability of healthcare: a m-TISM approach

Contact Info

Product

Resources

About