Unsupervised Learning from Narrated Instruction Videos

Image captioning, which aims to automatically generate a sentence description for an image, has attracted much research attention in cognitive computing. The task is rather challenging, since it requires cognitively combining the techniques from both computer vision and natural language processing domains. Existing CNN-RNN framework-based methods suffer from two main problems: in the training phase, all the words of captions are treated equally without considering the importance of different words; in the caption generation phase, the semantic objects or scenes might be misrecognized. In our paper, we propose a method based on the encoder-decoder framework, named Reference based Long Short Term Memory (R-LSTM), aiming to lead the model to generate a more descriptive sentence for the given image by introducing reference information. Specifically, we assign different weights to the words according to the correlation between words and images during the training phase. We additionally maximize the consensus score between the captions generated by the captioning model and the reference information from the neighboring images of the target image, which can reduce the misrecognition problem. We have conducted extensive experiments and comparisons on the benchmark datasets MS COCO and Flickr30k. The results show that the proposed approach can outperform the state-of-the-art approaches on all metrics, especially achieving a 10.37% improvement in terms of CIDEr on MS COCO. By analyzing the quality of the generated captions, we come to a conclusion that through the introduction of reference information, our model can learn the key information of images and generate more trivial and relevant words for images.

show abstract

“…Please note that there are other captioning tasks that are related to our research, such as dense captioning [22] and video captioning [1,48].…”

Section: Related Workmentioning

confidence: 99%

“…We propose a weighting strategy based on words' similarity by considering the semantic information. For word s i , suppose the synonym set ('synset') of its kth meaning is ss ik by WordNet, 1 we compute the weight of the synset as follows:…”

Section: Weighted Trainingmentioning

confidence: 99%

Neural Image Caption Generation with Weighted Training and Reference

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Among them, instructional videos provide more intuitive visual examples, and will be focused on in this paper. With the explosion of video data on the Internet, people around the world have uploaded and watched substantial instructional videos [6], [59], covering miscellaneous categories. As suggested by the scientists in educational psychology [54], novices often face difficulties in learning from the whole realistic task, and it is necessary to divide the whole task into smaller segments or steps as a form of simplification.…”

Section: Introductionmentioning

confidence: 99%

“…Accordingly, a variety of relative tasks have been studied by morden computer vision community in recent years (e.g., action temporal localization [74], [80], video summarization [23], [49], [79] and video caption [35], [77], [83], etc). Also, increasing efforts have been devoted to exploring different challenges of instructional video analysis [6], [31], [59], [82] evidence, Fig. 2 shows the growing number of publications in the top venues over the recent ten years.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on instructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at http://coin-dataset.github.io. DomainTaskStep VehiclesHousehold Items Change the Car Tire {unscrew the screws, jack up the car, remove the tire, put on the tire, tighten the screws } {remove the door knob, remove bolt and pin board, install new pin board, install new bolt, install new door knob } Replace the Door Knob

show abstract

Full-Body Awareness from Partial Observations

Rockwell

Fouhey

2020

Computer Vision – ECCV 2020

View full text Add to dashboard Cite

Unsupervised Learning from Narrated Instruction Videos

Cited by 221 publications

References 25 publications

Neural Image Caption Generation with Weighted Training and Reference

Neural Image Caption Generation with Weighted Training and Reference

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Full-Body Awareness from Partial Observations

Contact Info

Product

Resources

About