Grounding of Textual Phrases in Images by Reconstruction

Rohrbach, Anna; Rohrbach, Marcus; Hu, Ronghang; Darrell, Trevor; Schiele, Bernt

doi:10.1007/978-3-319-46448-0_49

Cited by 422 publications

(558 citation statements)

References 47 publications

(95 reference statements)

Supporting

Mentioning

553

Contrasting

Order By: Relevance

“…[72] combine CNNs with LSTMs for visual grounding. The model first encodes a phrase which describes part of an image using an LSTM, then learns to attend to the appropriate location in the image to accurately reconstruct the phrase.…”

Section: Contemporaneous and Subsequent Workmentioning

confidence: 99%

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Donahue¹,

Hendricks²,

Rohrbach³

et al. 2014

Self Cite

2,391

2,848

View full text Add to dashboard Cite

Abstract-Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

show abstract

Section: Contemporaneous and Subsequent Workmentioning

confidence: 99%

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Donahue¹,

Hendricks²,

Rohrbach³

et al. 2014

Self Cite

2,391

2,848

View full text Add to dashboard Cite

show abstract

“…R@1 R@5 R@10 MCB [11] 48.7 --GroundeR [35] 47.8 --Embedding Network [43] 51.0 70.4 75.5 Similarity Network [43] 51.0 70.3 75.0 SPC [33] 55.4 --IGOP [47] 53.9 --CITE [32] 59. Table 7.…”

Section: Methodsmentioning

confidence: 99%

“…We next analyze the benefit of our jWAE framework for phrase localization on the Flickr30k Entities dataset [34]. Phrase localization associates (grounds) a phrase to a region in the image using bounding boxes [5,35,43,47]. Following [43], we formulate phrase localization as a retrieval problem where given an image and a phrase from its associated sentence, the phrase is mapped to the regions in the image.…”

Section: Phrase Localizationmentioning

confidence: 99%

“…Bounding box proposal regions are extracted with Edge Box [49]. Since we are mainly interested in evaluating the quality of our multimodal embeddings rather than the specific task, we compared to other embeddingbased approaches [35,43]. Additionally, we integrate our jWAE framework with Conditional Image Text Embedding (CITE) [32], which builds on top of the embeddings from the Similarity Network.…”

Section: Phrase Localizationmentioning

confidence: 99%

See 1 more Smart Citation

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Mahajan

Botschen

Gurevych

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

One of the key challenges in learning joint embeddings of multiple modalities, e.g. of images and text, is to ensure coherent cross-modal semantics that generalize across datasets. We propose to address this through joint Gaussian regularization of the latent representations. Building on Wasserstein autoencoders (WAEs) to encode the input in each domain, we enforce the latent embeddings to be similar to a Gaussian prior that is shared across the two domains, ensuring compatible continuity of the encoded semantic representations of images and texts. Semantic alignment is achieved through supervision from matching imagetext pairs. To show the benefits of our semi-supervised representation, we apply it to cross-modal retrieval and phrase localization. We not only achieve state-of-the-art accuracy, but significantly better generalization across datasets, owing to the semantic continuity of the latent space.

show abstract

“…Motivated from co-reference resolution tasks in NLP, a number of studies have investigated matching free-form phrases with images where the task is to locate each visual entity mentioned in a caption by predicting a bounding box in the corresponding image (Hodosh et al, 2010;Kong et al, 2014;Plummer et al, 2015;Rohrbach et al, 2015).…”

Section: Text-to-image Co-referencingmentioning

confidence: 99%

Leveraging Captions in the Wild to Improve Object Detection

Kilickaya

Ikizler-Cinbis

Erdem

et al. 2016

Proceedings of the 5th Workshop on Vision and Language

View full text Add to dashboard Cite

In this study, we explore whether the captions in the wild can boost the performance of object detection in images. Captions that accompany images usually provide significant information about the visual content of the image, making them an important resource for image understanding. However, captions in the wild are likely to include numerous types of noises which can hurt visual estimation. In this paper, we propose data-driven methods to deal with the noisy captions and utilize them to improve object detection. We show how a pre-trained state-of-theart object detector can take advantage of noisy captions. Our experiments demonstrate that captions provide promising cues about the visual content of the images and can aid in improving object detection.

show abstract

Grounding of Textual Phrases in Images by Reconstruction

Cited by 422 publications

References 47 publications

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Leveraging Captions in the Wild to Improve Object Detection

Contact Info

Product

Resources

About