I2T: Image Parsing to Text Description

Yao, Baicheng; Yang, Xiong; Li, Lin; Lee, Mun Wai; Zhu, Song‐Chun

doi:10.1109/jproc.2010.2050411

Cited by 246 publications

(116 citation statements)

References 69 publications

Supporting

Mentioning

111

Contrasting

Unclassified

Order By: Relevance

“…Yao et al [26] look at the problem of generating text with a comprehensive system built on various hierarchical knowledge ontologies and using a human in the loop for hierarchical image parsing (except in specialized circumstances). In contrast, our work automatically mines knowledge about textual representation, and parses images fully automatically -without a human operator -and with a much simpler approach overall.…”

Section: Related Workmentioning

confidence: 99%

Baby talk: Understanding and generating simple image descriptions

et al. 2011

View full text Add to dashboard Cite

We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

show abstract

Section: Related Workmentioning

confidence: 99%

Baby talk: Understanding and generating simple image descriptions

et al. 2011

View full text Add to dashboard Cite

show abstract

“…Instead of humans and their activities, they focused on detection of objects, their inter-relations and events in videos. Yao et al presented their work on video-to-text description [6]; this work was dependent on a significant amount of annotated data, a requirement that is avoided in this paper. Yang et al developed a framework for static image to textual descriptions where they dealt with images with up to two objects [7].…”

Section: Related Workmentioning

confidence: 99%

A framework for creating natural language descriptions of video streams

Khan

Harbi

Gotoh

2015

Information Sciences

View full text Add to dashboard Cite

Article:Khan, M., AlHarbi, N. and Gotoh, Y. (2015) A framework for creating natural language descriptions of video streams. Information Sciences, 303. 61 -82. ISSN 1872-6291 https://doi.org/10. 1016/j.ins.2014.12.034 eprints@whiterose.ac.uk https://eprints.whiterose.ac.uk/ Reuse Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or private study within the limits of fair dealing. The publisher or other rights-holder may allow further reproduction and re-use of this version -refer to the White Rose Research Online record for this item. Where records identify the publisher as the copyright holder, users can verify any specific terms of use on the publisher's website. Takedown AbstractThis contribution addresses generation of natural language descriptions for important visual content present in video streams. The work starts with implementation of conventional image processing techniques to extract high-level visual features such as humans and their activities. These features are converted into natural language descriptions using a template-based approach built on a context free grammar, incorporating spatial and temporal information. The task is challenging particularly because feature extraction processes are erroneous at various levels. In this paper we explore approaches to accommodating potentially missing information, thus creating a coherent description. Sample automatic annotations are created for video clips presenting humans' close-ups and actions, and qualitative analysis of the approach is made from various aspects. Additionally a task-based scheme is introduced that provides quantitative evaluation for relevance of generated descriptions. Further, to show the framework's potential for extension, a scalability study is conducted using video categories that are not targeted during the development.

show abstract

“…This approach overcomes some of the limitations of Hidden Markov Models and Dynamic Bayesian Networks, because not only the model parameters are learned, but the model structures too. In [11], and-or graphs are used to generate text descriptions for a large dataset containing many different types of videos and images. Note that these last four studies are not just concerned with classifying activities.…”

Section: Related Workmentioning

confidence: 99%

Towards High-Level Human Activity Recognition through Computer Vision and Temporal Logic

IJsselmuiden

Stiefelhagen

2010

KI 2010: Advances in Artificial Intelligence

View full text Add to dashboard Cite

Abstract. Most approaches to the visual perception of humans do not include high-level activity recognitition. This paper presents a system that fuses and interprets the outputs of several computer vision components as well as speech recognition to obtain a high-level understanding of the perceived scene. Our laboratory for investigating new ways of human-machine interaction and teamwork support, is equipped with an assemblage of cameras, some close-talking microphones, and a videowall as main interaction device. Here, we develop state of the art real-time computer vision systems to track and identify users, and estimate their visual focus of attention and gesture activity. We also monitor the users' speech activity in real time. This paper explains our approach to highlevel activity recognition based on these perceptual components and a temporal logic engine.

show abstract

I2T: Image Parsing to Text Description

Cited by 246 publications

References 69 publications

Baby talk: Understanding and generating simple image descriptions

Baby talk: Understanding and generating simple image descriptions

A framework for creating natural language descriptions of video streams

Towards High-Level Human Activity Recognition through Computer Vision and Temporal Logic

Contact Info

Product

Resources

About