2023
DOI: 10.48550/arxiv.2301.01953
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Trajectory-Word Alignments for Video-Language Tasks

Abstract: Aligning objects with words plays a critical role in Image-Language BERT (IL-BERT) and Video-Language BERT (VDL-BERT). Different from the image case where an object covers some spatial patches, an object in a video usually appears as an object trajectory, i.e., it spans over a few spatial but longer temporal patches and thus contains abundant spatiotemporal contexts. However, modern VDL-BERTs neglect this trajectory characteristic that they usually follow IL-BERTs to deploy the patch-to-word (P2W) attention wh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 36 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?