TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval

Sebastian, Clint; Imbriaco, Raffaele; Meletis, Panagiotis; Dubbelman, Gijs; Bondarev, Egor

doi:10.1109/cvprw53098.2021.00467

Cited by 5 publications

(4 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”

Section: Evaluation Resultsmentioning

confidence: 99%

“…Pirazh et al [14] and Tam et al [30] adopts CLIP [35] to extract frame features and textual features. TIED [37] proposes an encoder-decoder based model in which the encoder embeds two modalities into the common space and the decoder jointly optimizes these embeddings by an input-token-reconstrucion task. Tien-Phat et al [31] adapts COOT [8] to model the cross-modal relationships with both appearance and motion attributes.…”

Section: Text-based Vehicle Retrievalmentioning

confidence: 99%

“…Though these methods have made some progress, they fail to mine adequate multi-granularity information. For instance, the relationships between the target vehicle and its neighbors is ignored in [1,14,37,38], and [30,36] fail to fully utilze the color and type information. Moreover, they don't take good advantage of rich textual granular features.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Du¹,

Zhang²,

Ruan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at https://github.com/dyhBUPT/OMG.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%

Section: Text-based Vehicle Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Du¹,

Zhang²,

Ruan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In the 5th NVIDIA AI City Challenge, the majority of teams [2], [16] [17], [18] [19], [20] chose to extract sentence embeddings of the queries, whereas two teams [21], [22] processed the NL queries using conventional NLP techniques. For cross-modality learning, certain teams [20], [2] used ReID models with the adoption of vision models pre-trained on visual ReID data and language models pre-trained on the given queries from the dataset. The motion of vehicles is an integral component of the NL descriptions.…”

Section: Related Work a Natural Language-based Vehicle-based Video Re...mentioning

confidence: 99%

DAKRS: Domain Adaptive Knowledge-Based Retrieval System for Natural Language-Based Vehicle Retrieval

Nguyen

Chung

2023

IEEE Access

View full text Add to dashboard Cite

Given Natural Language (NL) text descriptions, NL-based vehicle retrieval aims to extract target vehicles from a multi-view multi-camera traffic video pool. Due to inherent distinctions between textual and visual data, this is a challenging multi-modal retrieval task that requires robust feature extractors (e.g. neural network) to well-align the abstract representations of texts and images in the same domain. However, solutions to the problem have been challenged by the high data complexities of not only the multiview, multi-camera attributes of visual data and the diverse range of textual descriptions but also a lack of high-volume datasets in this relatively new field, alongside a prominently large domain gap between training and test sets. Many existing approaches have developed computationally expensive models to separately extract the subspaces of language and vision before blending into the same shared representation space while only focusing on single-modal information and ignoring much of the multi-modal information to deal with the aforementioned issues. Hence, we propose a Domain Adaptive Knowledge-based Retrieval System (DAKRS) to effectively and efficiently align multi-modal knowledge in a setting of limited labels. Our contributions are threefold: (i) An efficient extension of Contrastive Language-Image Pre-training (CLIP)'s transfer learning into a baseline text-to-image multi-modular vehicle retrieval framework; (ii) A data enhancement module to create pseudo-vehicle tracks from the traffic video pool by leveraging the robustness of baseline retrieval model combine with background subtraction; and (iii) A SSDA (SSDA) scheme to engineer pseudo-labels for adapting model parameters to the target domain distribution. Experimental results are benchmarked on the Cityflow-NL dataset, illustrating our competitiveness against state-ofthe-art performances in terms of effectiveness and efficiency without needing further post-processing or ensembling.

show abstract

Multi-branch Deep Neural Model for Natural Language-Based Vehicle Retrieval

Shankaranarayan

2023

Computer Vision and Machine Intelligence

View full text Add to dashboard Cite

TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval

Cited by 5 publications

References 31 publications

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

DAKRS: Domain Adaptive Knowledge-Based Retrieval System for Natural Language-Based Vehicle Retrieval

Multi-branch Deep Neural Model for Natural Language-Based Vehicle Retrieval

Contact Info

Product

Resources

About