2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2021
DOI: 10.1109/cvprw53098.2021.00467
|View full text |Cite
|
Sign up to set email alerts
|

TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval

Abstract: Retrieving specific vehicle tracks by Natural Language (NL)-based descriptions is a convenient way to monitor vehicle movement patterns and traffic-related events. NLbased image retrieval has several applications in smart cities, traffic control, etc. In this work, we propose TIED, a text-to-image encoder-decoder model for the simultaneous extraction of visual and textual information for vehicle track retrieval. The model consists of an encoder network that enforces the two modalities into a common latent spac… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 31 publications
0
4
0
Order By: Relevance
“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”
Section: Evaluation Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”
Section: Evaluation Resultsmentioning
confidence: 99%
“…Pirazh et al [14] and Tam et al [30] adopts CLIP [35] to extract frame features and textual features. TIED [37] proposes an encoder-decoder based model in which the encoder embeds two modalities into the common space and the decoder jointly optimizes these embeddings by an input-token-reconstrucion task. Tien-Phat et al [31] adapts COOT [8] to model the cross-modal relationships with both appearance and motion attributes.…”
Section: Text-based Vehicle Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…In the 5th NVIDIA AI City Challenge, the majority of teams [2], [16] [17], [18] [19], [20] chose to extract sentence embeddings of the queries, whereas two teams [21], [22] processed the NL queries using conventional NLP techniques. For cross-modality learning, certain teams [20], [2] used ReID models with the adoption of vision models pre-trained on visual ReID data and language models pre-trained on the given queries from the dataset. The motion of vehicles is an integral component of the NL descriptions.…”
Section: Related Work a Natural Language-based Vehicle-based Video Re...mentioning
confidence: 99%