Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

8
1,164
1
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 1,040 publications
(1,175 citation statements)
references
References 30 publications
8
1,164
1
2
Order By: Relevance
“…Cross-modal pre-training. In the past year, many works extended BERT to model cross-modal data [22,34,36,5,20,35]. The recent BERT model for video-text modeling [35] introduces visual words for video frames encoding, where local regional information is largely ignored.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Cross-modal pre-training. In the past year, many works extended BERT to model cross-modal data [22,34,36,5,20,35]. The recent BERT model for video-text modeling [35] introduces visual words for video frames encoding, where local regional information is largely ignored.…”
Section: Related Workmentioning
confidence: 99%
“…In this way, clip-level actions are represented, and the corresponding action label is inserted. Besides global action information, we incorporate local regional information to provide fine-grained visual cues [22,36,34,20,5]. Object regions provide detailed visual clues about the whole scene, including the regional object feature, the position of the object.…”
Section: Introductionmentioning
confidence: 99%
“…To overcome these limitations, some researchers developed pretrained models [ 20 , 21 ] in which natural language instructions and images for the VLN task are embedded together with large-scale benchmark datasets in addition to R2R datasets. VisualBERT [ 22 ], Vision-and-Language BERT (ViLBERT) [ 23 ], Visual-Linguistic BERT (VL-BERT) [ 24 ], and UNiversal Image-TExt Representation (UNITER) [ 25 ], are pretrained models applicable to various vision–language tasks. There are also models pretrained specifically for VLN tasks [ 20 , 21 ].…”
Section: Related Workmentioning
confidence: 99%
“…At test time, only captions are used to enable a fair comparison of the models. We use a model similar to Chen et al (2020) with a different token type embedding for inline references.…”
Section: Image-text Matchingmentioning
confidence: 99%
“…Model Our model follows Chen et al (2020) with a few modifications. We tokenize the input text and pass the token embeddings through a BERT encoder (Devlin et al).…”
Section: Modelmentioning
confidence: 99%