2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021
DOI: 10.1109/cvprw53098.2021.00472
|View full text |Cite
|
Sign up to set email alerts
|

Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 19 publications
0
8
0
Order By: Relevance
“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”
Section: Evaluation Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”
Section: Evaluation Resultsmentioning
confidence: 99%
“…SBNet [15] presents a substitution module that helps project features from different domains into the same space, and a future prediction module to learn temporal information by predicting the next frame. Pirazh et al [14] and Tam et al [30] adopts CLIP [35] to extract frame features and textual features. TIED [37] proposes an encoder-decoder based model in which the encoder embeds two modalities into the common space and the decoder jointly optimizes these embeddings by an input-token-reconstrucion task.…”
Section: Text-based Vehicle Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…In the 5th NVIDIA AI City Challenge, the majority of teams [2], [16] [17], [18] [19], [20] chose to extract sentence embeddings of the queries, whereas two teams [21], [22] processed the NL queries using conventional NLP techniques. For cross-modality learning, certain teams [20], [2] used ReID models with the adoption of vision models pre-trained on visual ReID data and language models pre-trained on the given queries from the dataset.…”
Section: Related Work a Natural Language-based Vehicle-based Video Re...mentioning
confidence: 99%
“…3) Deep Feature Extraction: To extract the deep visual features for this dataset, we use the ResNet101 IBN-a as the backbone architecture of the EVER model. To train this model, we use CityFlow-ReID and VehicleX [57] that are provided However as discussed in [52], [58], there is a significant domain shift between this synthetic data and the real data used for evaluation. Therefore, similar to our approach for vehicle detection on the RITIS data camera tracking system unless expensive computational units are available.…”
Section: B Ai City Challengementioning
confidence: 99%