2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021
DOI: 10.1109/cvprw53098.2021.00481
|View full text |Cite
|
Sign up to set email alerts
|

All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

Abstract: Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate singlevehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 30 publications
(49 reference statements)
0
6
0
Order By: Relevance
“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”
Section: Evaluation Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”
Section: Evaluation Resultsmentioning
confidence: 99%
“…AYCE [36] proposes a modular solution which applies BERT [41] to embed textual descriptions and a CNN [10] with a Transformer model [43] to embed visual information. SBNet [15] presents a substitution module that helps project features from different domains into the same space, and a future prediction module to learn temporal information by predicting the next frame.…”
Section: Text-based Vehicle Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…Before being applied in visual tracking, vision-language fusion models have commonly been used in audio-visual speech recognition (AVSR) applications [36] and image retrieval [37] and video question-answering tasks [38]. In recent years, transformer-based models have become the preferred architecture for multimodal pretraining due to their excellent capacity for use in modeling global dependencies [9]. Lu [39] propose ViLBERT for use in inputting linguistic features and visual features into transformer encoders; they adopted a common attention mechanism to fuse heterogeneous information.…”
Section: Vision-language Fusion Modelmentioning
confidence: 99%
“…From a model-based perspective, tracking algorithms have evolved from classical correlation-filter-based models to deep neural networks due to their powerful feature representation [1][2][3][4][5][6]. In the last few years, transformer-based trackers have shown improved performances due to the development of an attention mechanism that enables the modeling of complex feature interactions [7][8][9]. However, existing single-model trackers do not perform as well in practice as they have done during testing with publicly available datasets, especially in challenging scenarios such as viewpoint changes, f ast motion, and illumination variation, as shown in Figure 1; here, poor feature representations and model drifting often lead to tracking failures.…”
Section: Introductionmentioning
confidence: 99%