Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems

Khorramshahi, Pirazh; Rambhatla, Sai Saketh; Chellappa, Rama

doi:10.1109/cvprw53098.2021.00472

Cited by 14 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”

Section: Evaluation Resultsmentioning

confidence: 99%

“…SBNet [15] presents a substitution module that helps project features from different domains into the same space, and a future prediction module to learn temporal information by predicting the next frame. Pirazh et al [14] and Tam et al [30] adopts CLIP [35] to extract frame features and textual features. TIED [37] proposes an encoder-decoder based model in which the encoder embeds two modalities into the common space and the decoder jointly optimizes these embeddings by an input-token-reconstrucion task.…”

Section: Text-based Vehicle Retrievalmentioning

confidence: 99%

“…Though these methods have made some progress, they fail to mine adequate multi-granularity information. For instance, the relationships between the target vehicle and its neighbors is ignored in [1,14,37,38], and [30,36] fail to fully utilze the color and type information. Moreover, they don't take good advantage of rich textual granular features.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Du¹,

Zhang²,

Ruan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at https://github.com/dyhBUPT/OMG.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%

Section: Text-based Vehicle Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Du¹,

Zhang²,

Ruan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In the 5th NVIDIA AI City Challenge, the majority of teams [2], [16] [17], [18] [19], [20] chose to extract sentence embeddings of the queries, whereas two teams [21], [22] processed the NL queries using conventional NLP techniques. For cross-modality learning, certain teams [20], [2] used ReID models with the adoption of vision models pre-trained on visual ReID data and language models pre-trained on the given queries from the dataset.…”

Section: Related Work a Natural Language-based Vehicle-based Video Re...mentioning

confidence: 99%

DAKRS: Domain Adaptive Knowledge-Based Retrieval System for Natural Language-Based Vehicle Retrieval

Nguyen

Chung

2023

IEEE Access

View full text Add to dashboard Cite

Given Natural Language (NL) text descriptions, NL-based vehicle retrieval aims to extract target vehicles from a multi-view multi-camera traffic video pool. Due to inherent distinctions between textual and visual data, this is a challenging multi-modal retrieval task that requires robust feature extractors (e.g. neural network) to well-align the abstract representations of texts and images in the same domain. However, solutions to the problem have been challenged by the high data complexities of not only the multiview, multi-camera attributes of visual data and the diverse range of textual descriptions but also a lack of high-volume datasets in this relatively new field, alongside a prominently large domain gap between training and test sets. Many existing approaches have developed computationally expensive models to separately extract the subspaces of language and vision before blending into the same shared representation space while only focusing on single-modal information and ignoring much of the multi-modal information to deal with the aforementioned issues. Hence, we propose a Domain Adaptive Knowledge-based Retrieval System (DAKRS) to effectively and efficiently align multi-modal knowledge in a setting of limited labels. Our contributions are threefold: (i) An efficient extension of Contrastive Language-Image Pre-training (CLIP)'s transfer learning into a baseline text-to-image multi-modular vehicle retrieval framework; (ii) A data enhancement module to create pseudo-vehicle tracks from the traffic video pool by leveraging the robustness of baseline retrieval model combine with background subtraction; and (iii) A SSDA (SSDA) scheme to engineer pseudo-labels for adapting model parameters to the target domain distribution. Experimental results are benchmarked on the Cityflow-NL dataset, illustrating our competitiveness against state-ofthe-art performances in terms of effectiveness and efficiency without needing further post-processing or ensembling.

show abstract

“…3) Deep Feature Extraction: To extract the deep visual features for this dataset, we use the ResNet101 IBN-a as the backbone architecture of the EVER model. To train this model, we use CityFlow-ReID and VehicleX [57] that are provided However as discussed in [52], [58], there is a significant domain shift between this synthetic data and the real data used for evaluation. Therefore, similar to our approach for vehicle detection on the RITIS data camera tracking system unless expensive computational units are available.…”

Section: B Ai City Challengementioning

confidence: 99%

Scalable and Real-time Multi-Camera Vehicle Detection, Re-Identification, and Tracking

Khorramshahi¹,

Shenoy²,

Pack³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Multi-camera vehicle tracking is one of the most complicated tasks in Computer Vision as it involves distinct tasks including Vehicle Detection, Tracking, and Re-identification. Despite the challenges, multi-camera vehicle tracking has immense potential in transportation applications including speed, volume, origin-destination (O-D), and routing data generation. Several recent works have addressed the multi-camera tracking problem. However, most of the effort has gone towards improving accuracy on high-quality benchmark datasets while disregarding lower camera resolutions, compression artifacts and the overwhelming amount of computational power and time needed to carry out this task on its edge and thus making it prohibitive for largescale and real-time deployment. Therefore, in this work we shed light on practical issues that should be addressed for the design of a multi-camera tracking system to provide actionable and timely insights. Moreover, we propose a real-time city-scale multi-camera vehicle tracking system that compares favorably to computationally intensive alternatives and handles real-world, low-resolution CCTV instead of idealized and curated video streams. To show its effectiveness, in addition to integration into the Regional Integrated Transportation Information System (RITIS) 1 , we participated in the 2021 NVIDIA AI City multicamera tracking challenge and our method is ranked among the top five performers on the public leaderboard.

show abstract

Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems

Cited by 14 publications

References 19 publications

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

DAKRS: Domain Adaptive Knowledge-Based Retrieval System for Natural Language-Based Vehicle Retrieval

Scalable and Real-time Multi-Camera Vehicle Detection, Re-Identification, and Tracking

Contact Info

Product

Resources

About