Huijuan Xu scite author profile

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.

show abstract

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Saenko

2016

583

462

View full text Add to dashboard Cite

Abstract. We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the attention weights. We evaluate our model on two published visual question answering datasets, DAQUAR [1] and VQA [2], and obtain improved results compared to a strong deep baseline model (iBOWIMG) which concatenates image and question features to predict the answer [3].

show abstract

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Xu¹,

2017

View full text Add to dashboard Cite

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Plummer

et al. 2019

AAAI

264

225

View full text Add to dashboard Cite

We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.

show abstract

Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks

et al. 2020

View full text Add to dashboard Cite

Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning

Chen

Liu

et al. 2021

183

View full text Add to dashboard Cite

Spatio-Temporal Action Graph Networks

Herzig

Levi²,

et al. 2019

View full text Add to dashboard Cite

Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance. Activity recognition models that represent object interactions explicitly have the potential to learn in a more efficient manner than those that represent scenes with global descriptors. We propose a novel inter-object graph representation for activity recognition based on a disentangled graph embedding with direct observation of edge appearance. In contrast to prior efforts, our approach uses explicit appearance for high order relations derived from objectobject interaction, formed over regions that are the union of the spatial extent of the constituent objects. We employ a novel factored embedding of the graph structure, disentangling a representation hierarchy formed over spatial dimensions from that found over temporal variation. We demonstrate the effectiveness of our model on the Charades activity recognition benchmark, as well as a new dataset of driving activities focusing on multi-object interactions with near-collision events. Our model offers significantly improved performance compared to baseline approaches without object-graph representations, or with previous graphbased models.

show abstract

Learning Instance Activation Maps for Weakly Supervised Instance Segmentation

Zhu

Zhou

et al. 2019

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Huijuan Xu

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Something-Else: Compositional Action Recognition With Spatial-Temporal Interaction Networks

Meta-Baseline: Exploring Simple Meta-Learning for Few-Shot Learning

Spatio-Temporal Action Graph Networks

Learning Instance Activation Maps for Weakly Supervised Instance Segmentation

Contact Info

Product

Resources

About