A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

Yu, Haonan; Narayanaswamy, Siddharth; Barbu, Andrei; Siskind, Jeffrey Mark

doi:10.1613/jair.4556

Cited by 29 publications

(21 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is possibly because of small objects, such as utensils and ingredients, are hard to detect using global visual features but are crucial for describing a recipe. Hence, one future extension for our work is to incorporate object detectors/trackers [39,40] into the current captioning system. We show qualitative results in Fig.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

End-to-End Dense Video Captioning with Masked Transformer

Zhou

Corso

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

471

405

View full text Add to dashboard Cite

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

End-to-End Dense Video Captioning with Masked Transformer

Zhou

Corso

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

471

405

View full text Add to dashboard Cite

show abstract

“…Learning occurs mostly in simulation and with little visual ambiguity, and the resulting model is not a parser but a means of associating n-grams with visual concepts. Siddharth et al (2014) and Yu et al (2015) acquire the meaning of a lexicon from videos paired with sentences but assume a fully-trained parser. Matuszek et al (2012) similarly present a model to learn the meanings and referents of words restricted to attributes and static scenes.…”

Section: Prior Workmentioning

confidence: 99%

Grounding language acquisition by training semantic parsers using captioned videos

Ross

Barbu

Berzak

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

We develop a semantic parser that is trained in a grounded setting using pairs of videos captioned with sentences. This setting is both data-efficient, requiring little annotation, and similar to the experience of children where they observe their environment and listen to speakers. The semantic parser recovers the meaning of English sentences despite not having access to any annotated sentences. It does so despite the ambiguity inherent in vision where a sentence may refer to any combination of objects, object properties, relations or actions taken by any agent in a video. For this task, we collected a new dataset for grounded language acquisition. Learning a grounded semantic parser-turning sentences into logical forms using captioned videos-can significantly expand the range of data that parsers can be trained on, lower the effort of training a semantic parser, and ultimately lead to a better understanding of child language acquisition.

show abstract

“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”

Section: Figmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

Self Cite

View full text Add to dashboard Cite

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object

show abstract

A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

Cited by 29 publications

References 86 publications

End-to-End Dense Video Captioning with Masked Transformer

End-to-End Dense Video Captioning with Masked Transformer

Grounding language acquisition by training semantic parsers using captioned videos

Sentence Directed Video Object Codiscovery

Contact Info

Product

Resources

About