BabyTalk: Understanding and Generating Simple Image Descriptions

Kulkarni, Girish; Premraj, Visruth; Ordóñez, Vicente; Dhar, Sagnik; Li, Siming; Choi, Yejin; Berg, Alexander C.; Berg, Tamara L.

doi:10.1109/tpami.2012.162

Cited by 691 publications

(373 citation statements)

References 39 publications

Supporting

Mentioning

345

Contrasting

Unclassified

Order By: Relevance

“…Our work has been inspired by the works building very large-scale image databases [8,38] and the works establishing semantic connections of texts and images [25]. We observe good semantic coherence between labels obtained by hierarchical document topic models [6] and clinician's assessment.…”

Section: Introductionmentioning

confidence: 84%

“…Image-to-language correspondence was learned from ImageNet dataset and reasonably high quality image description datasets (Pascal1K [36], Flickr8K [16], Flickr30K [47]) in [20], where such caption datasets are not available in the medical domain. Graphical models have been employed to predict image attributes ( [27,39]), or to describe images ( [25]) using manually annotated datasets ( [36,26]). Automatic label mining on large, unlabeled datasets is presented in [35,18], however the variety of the label-space is limited (image text annotations).…”

Section: Related Workmentioning

confidence: 99%

“…Our deep CNN models on medical image modalities (mostly CT, MRI) are initialized with the model parameters pre-trained from ImageNet [8] using Caffe [19] framework. Kulkarni et al [25] have spearheaded the efforts of learning the semantic connections between image contents and the sentences describing them (i.e., captions). Detecting objects of interest, attributes and prepositions and applying contextual regularization with a conditional random field (CRF) is a feasible approach [25] because many useful tools are available in computer vision.…”

Section: Introductionmentioning

confidence: 99%

“…Kulkarni et al [25] have spearheaded the efforts of learning the semantic connections between image contents and the sentences describing them (i.e., captions). Detecting objects of interest, attributes and prepositions and applying contextual regularization with a conditional random field (CRF) is a feasible approach [25] because many useful tools are available in computer vision. There has not yet been much comparable development on large-scale medical imaging understanding.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Interleaved text/image Deep Mining on a large-scale radiology database

Shin

Lü

Kim

et al. 2015

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Despite tremendous progress in computer vision, effective learning on very large-scale (> 100K patients) medical image databases has been vastly hindered. We present an interleaved text/image deep learning system to extract and mine the semantic interactions of radiology images and reports from a national research hospital's picture archiving and communication system. Instead of using full 3D medical volumes, we focus on a collection of representative~216K 2D key images/slices (selected by clinicians for diagnostic reference) with text-driven scalar and vector labels. Our system interleaves between unsupervised learning (e.g., latent Dirichlet allocation, recurrent neural net language models) on document-and sentence-level texts to generate semantic labels and supervised learning via deep convolutional neural networks (CNNs) to map from images to label spaces. Disease-related key words can be predicted for radiology images in a retrieval manner. We have demonstrated promising quantitative and qualitative results. The large-scale datasets of extracted key images and their categorization, embedded vector labels and sentence descriptions can be harnessed to alleviate the deep learning "datahungry" obstacle in the medical domain.

show abstract

Section: Introductionmentioning

confidence: 84%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Interleaved text/image Deep Mining on a large-scale radiology database

Shin

Lü

Kim

et al. 2015

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…Here we take an analagous approach-modifying the image retrieval stage of data-driven pipeline-for the task of image captioning. There has been significant recent interest in generating natural language descriptions of photographs (Kulkarni et al 2013;Farhadi et al 2010b). These techniques are typically quite complex: they recognize various visual concepts such as objects, materials, scene types, and the spatial relationship among these entities, and then generate plausible natural language sentences based on this scene understanding.…”

Section: Scene Attributes As Global Featuresmentioning

confidence: 99%

The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding

et al. 2014

View full text Add to dashboard Cite

In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties, lighting, affordances, and spatial layout. Next, we build the "SUN attribute database" on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.

show abstract

Generating visual explanations with natural language

et al. 2021

View full text Add to dashboard Cite

We generate natural language explanations for a fine-grained visual recognition task. Our explanations fulfill two criteria. First, explanations are class discriminative, meaning they mention attributes in an image which are important to identify a class. Second, explanations are image relevant, meaning they reflect the actual content of an image. Our system, composed of an explanation sampler and phrase-critic model, generates class discriminative and image relevant explanations. In addition, we demonstrate that our explanations can help humans decide whether to accept or reject an AI decision.

show abstract

BabyTalk: Understanding and Generating Simple Image Descriptions

Cited by 691 publications

References 39 publications

Interleaved text/image Deep Mining on a large-scale radiology database

Interleaved text/image Deep Mining on a large-scale radiology database

The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding

Generating visual explanations with natural language

Contact Info

Product

Resources

About