ReferItGame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar; Ordóñez, Vicente; Matten, Mark; Berg, Tamara L.

doi:10.3115/v1/d14-1086

Cited by 828 publications

(835 citation statements)

References 24 publications

Supporting

Mentioning

830

Contrasting

Order By: Relevance

“…As we are interested in tracking by natural language specification, we augment the videos in OTB100 with natural language descriptions of the target object. Following the guidelines in [19] we ask annotators for a discriminative referring description of the target. For fairness the annotators describe the target based on the first frame only.…”

Section: Datasetsmentioning

confidence: 99%

“…ReferIt [19]. The ReferIt dataset is proposed in [19] for the task of object localization and segmentation by natural language expression.…”

Section: Datasetsmentioning

confidence: 99%

“…The ReferIt dataset is proposed in [19] for the task of object localization and segmentation by natural language expression. It is the largest publicly available dataset that contains natural language expressions annotated on segmented regions.…”

Section: Datasetsmentioning

confidence: 99%

“…To train the lingual specification network, we first pre-train the network on the ReferIt [19] dataset using segmentation masks, since language queries from Lingual OTB99 and Lingual ImageNet Videos are still limited. For the visual specification network, instead of using the full image as input, we follow [3] to crop a large search region around the center of the target box location.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…We fine-tune it using the training videos from Lingual OTB99 or Lingual ImageNet Videos. Similarly, our joint model is also fine-tuned based on pretrained networks using the ReferIt [19] and ImageNet classification datasets. The parameters of all the networks are all trained with a standard SGD solver with momentum.…”

Section: Implementation Detailsmentioning

confidence: 99%

See 4 more Smart Citations

Tracking by Natural Language Specification

Tao

Gavves

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

116

176

View full text Add to dashboard Cite

show abstract

Section: Datasetsmentioning

confidence: 99%

“…ReferIt [19]. The ReferIt dataset is proposed in [19] for the task of object localization and segmentation by natural language expression.…”

Section: Datasetsmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

See 3 more Smart Citations

Tracking by Natural Language Specification

Tao

Gavves

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

116

176

View full text Add to dashboard Cite

show abstract

Materials Research at Shanghai Jiao Tong University

Chen

Feng

2015

Advanced Materials

View full text Add to dashboard Cite

Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is critical for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled for a better trade-off between computational overheads and performance. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods.

show abstract

Visual Reasoning with Multi-hop Feature Modulation

Strub

Seurin

Perez

et al. 2018

Computer Vision – ECCV 2018

View full text Add to dashboard Cite

Recent breakthroughs in computer vision and natural language processing have spurred interest in challenging multi-modal tasks such as visual question-answering and visual dialogue. For such tasks, one successful approach is to condition image-based convolutional network computation on language via Feature-wise Linear Modulation (FiLM) layers, i.e., per-channel scaling and shifting. We propose to generate the parameters of FiLM layers going up the hierarchy of a convolutional network in a multi-hop fashion rather than all at once, as in prior work. By alternating between attending to the language input and generating FiLM layer parameters, this approach is better able to scale to settings with longer input sequences such as dialogue. We demonstrate that multi-hop FiLM generation achieves state-of-the-art for the short input sequence task ReferIt-on-par with single-hop FiLM generationwhile also significantly outperforming prior state-of-the-art and singlehop FiLM generation on the GuessWhat?! visual dialogue task.

show abstract

ReferItGame: Referring to Objects in Photographs of Natural Scenes

Cited by 828 publications

References 24 publications

Tracking by Natural Language Specification

Tracking by Natural Language Specification

Materials Research at Shanghai Jiao Tong University

Visual Reasoning with Multi-hop Feature Modulation

Contact Info

Product

Resources

About