Automatic Concept Discovery from Parallel Text and Visual Corpora

Sun, Chen; Gan, Chuang; Nevatia, Ram

doi:10.1109/iccv.2015.298

Cited by 96 publications

(59 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These captions are called candidate captions. The captions for the query image are selected from these captions pool [47,55,108,130]. These methods produce general and syntactically correct captions.…”

Section: Image Captioning Methodsmentioning

confidence: 99%

A Comprehensive Survey of Deep Learning for Image Captioning

et al. 2019

View full text Add to dashboard Cite

Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.

show abstract

Section: Image Captioning Methodsmentioning

confidence: 99%

A Comprehensive Survey of Deep Learning for Image Captioning

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Karpathy et al [16] propose a deep visualsemantic alignment (DVSA) model for image retrieval, which uses the BiLSTM to encode query features and R-CNN detector [11] to extract object representations. Sun et al [36] advise an automatic visual concept discovery algorithm to boost the performance of image retrieval. Moreover, Hu et al [15] and Mao et al [24] regard this problem as natural language object retrieval.…”

Section: Image/video Retrievalmentioning

confidence: 99%

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Zhang

Lin

Zhao

et al. 2019

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

198

152

View full text Add to dashboard Cite

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including (1) the syntactic structure of natural language queries; (2) long-range semantic dependencies in video context and (3) the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, propose a multi-head self-attention to capture long-range semantic dependencies from video context, and next employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents. The extensive experiments demonstrate the effectiveness of our proposed method. Our core code has been released at https://github.com/ikuinen/CMIN. CCS CONCEPTS• Information systems → Novelty in information retrieval. KEYWORDSQuery-based moment retrieval; syntactic GCN; multi-head selfattention; multi-stage cross-modal interaction ACM Reference Format:

show abstract

“…Typically in these approaches, web-crawlers collect easily available noisy multi-modal data [8,12,79] or e-books [17] which is jointly processed for labelling and knowledge extraction. The features are used for diverse applications such as classification and retrieval [68,76] or product description generation [82].…”

Section: Related Workmentioning

confidence: 99%

Aesthetic Image Captioning From Weakly-Labelled Photographs

Ghosal

Rana

Smolić

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Aesthetic image captioning (AIC) refers to the multimodal task of generating critical textual feedbacks for photographs. While in natural image captioning (NIC), deep models are trained in an end-to-end manner using large curated datasets such as MS-COCO, no such large-scale, clean dataset exists for AIC. Towards this goal, we propose an automatic cleaning strategy to create a benchmarking AIC dataset, by exploiting the images and noisy comments easily available from photography websites. We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset 'AVA-Captions', ( ∼ 230, 000 images with ∼ 5 captions per image). Additionally, by exploiting the latent associations between aesthetic attributes, we propose a strategy for training a convolutional neural network (CNN) based visual feature extractor, typically the first component of an AIC framework. The strategy is weakly supervised and can be effectively used to learn rich aesthetic representations, without requiring expensive ground-truth annotations. We finally showcase a thorough analysis of the proposed contributions using automatic metrics and subjective evaluations.

show abstract

Automatic Concept Discovery from Parallel Text and Visual Corpora

Cited by 96 publications

References 29 publications

A Comprehensive Survey of Deep Learning for Image Captioning

A Comprehensive Survey of Deep Learning for Image Captioning

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Aesthetic Image Captioning From Weakly-Labelled Photographs

Contact Info

Product

Resources

About