All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

Scribano, Carmelo; Sapienza, Davide; Franchini, Giorgia; Verucchi, Micaela; Bertogna, Marko

doi:10.1109/cvprw53098.2021.00481

Cited by 6 publications

(6 citation statements)

References 30 publications

(49 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our OMG with previous state-of-the-art methods in Table 3. It is shown that our Team MRR OMG(ours) 0.3012 Alibaba-UTS-ZJU [1] 0.1869 SDU-XidianU-SDJZU [38] 0.1613 SUNYKorea [33] 0.1594 Sun Asterisk [30] 0.1571 HCMUS [31] 0.1560 TUE [37] 0.1548 JHU-UMD [14] 0.1364 Modulabs-Naver-KookminU [15] 0.1195 Unimore [36] 0.1078…”

Section: Evaluation Resultsmentioning

confidence: 99%

“…AYCE [36] proposes a modular solution which applies BERT [41] to embed textual descriptions and a CNN [10] with a Transformer model [43] to embed visual information. SBNet [15] presents a substitution module that helps project features from different domains into the same space, and a future prediction module to learn temporal information by predicting the next frame.…”

Section: Text-based Vehicle Retrievalmentioning

confidence: 99%

“…Though these methods have made some progress, they fail to mine adequate multi-granularity information. For instance, the relationships between the target vehicle and its neighbors is ignored in [1,14,37,38], and [30,36] fail to fully utilze the color and type information. Moreover, they don't take good advantage of rich textual granular features.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Du¹,

Zhang²,

Ruan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Retrieving tracked-vehicles by natural language descriptions plays a critical role in smart city construction. It aims to find the best match for the given texts from a set of tracked vehicles in surveillance videos. Existing works generally solve it by a dual-stream framework, which consists of a text encoder, a visual encoder and a cross-modal loss function. Although some progress has been made, they failed to fully exploit the information at various levels of granularity. To tackle this issue, we propose a novel framework for the natural language-based vehicle retrieval task, OMG, which Observes Multiple Granularities with respect to visual representation, textual representation and objective functions. For the visual representation, target features, context features and motion features are encoded separately. For the textual representation, one global embedding, three local embeddings and a color-type prompt embedding are extracted to represent various granularities of semantic features. Finally, the overall framework is optimized by a cross-modal multi-granularity contrastive loss function. Experiments demonstrate the effectiveness of our method. Our OMG significantly outperforms all previous methods and ranks the 9th on the 6th AI City Challenge Track2. The codes are available at https://github.com/dyhBUPT/OMG.

show abstract

Section: Evaluation Resultsmentioning

confidence: 99%

Section: Text-based Vehicle Retrievalmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Du¹,

Zhang²,

Ruan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Before being applied in visual tracking, vision-language fusion models have commonly been used in audio-visual speech recognition (AVSR) applications [36] and image retrieval [37] and video question-answering tasks [38]. In recent years, transformer-based models have become the preferred architecture for multimodal pretraining due to their excellent capacity for use in modeling global dependencies [9]. Lu [39] propose ViLBERT for use in inputting linguistic features and visual features into transformer encoders; they adopted a common attention mechanism to fuse heterogeneous information.…”

Section: Vision-language Fusion Modelmentioning

confidence: 99%

“…From a model-based perspective, tracking algorithms have evolved from classical correlation-filter-based models to deep neural networks due to their powerful feature representation [1][2][3][4][5][6]. In the last few years, transformer-based trackers have shown improved performances due to the development of an attention mechanism that enables the modeling of complex feature interactions [7][8][9]. However, existing single-model trackers do not perform as well in practice as they have done during testing with publicly available datasets, especially in challenging scenarios such as viewpoint changes, f ast motion, and illumination variation, as shown in Figure 1; here, poor feature representations and model drifting often lead to tracking failures.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Features Alignment for Vision–Language Object Tracking

Ye,

Xiao,

Liu

2024

Remote Sensing

View full text Add to dashboard Cite

Vision–language tracking presents a crucial challenge in multimodal object tracking. Integrating language features and visual features can enhance target localization and improve the stability and accuracy of the tracking process. However, most existing fusion models in vision–language trackers simply concatenate visual and linguistic features without considering their semantic relationships. Such methods fail to distinguish the target’s appearance features from the background, particularly when the target changes dramatically. To address these limitations, we introduce an innovative technique known as multimodal features alignment (MFA) for vision–language tracking. In contrast to basic concatenation methods, our approach employs a factorized bilinear pooling method that conducts squeezing and expanding operations to create a unified feature representation from visual and linguistic features. Moreover, we integrate the co-attention mechanism twice to derive varied weights for the search region, ensuring that higher weights are placed on the aligned visual and linguistic features. Subsequently, the fused feature map with diverse distributed weights serves as the search region during the tracking phase, facilitating anchor-free grounding to predict the target’s location. Extensive experiments are conducted on multiple public datasets, and our proposed tracker obtains a success score of 0.654/0.553/0.447 and a precision score of 0.872/0.556/0.513 on OTB-LANG/LaSOT/TNL2K. These results are satisfying compared with those of recent state-of-the-art vision–language trackers.

show abstract

Multi-branch Deep Neural Model for Natural Language-Based Vehicle Retrieval

Shankaranarayan

2023

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers

Cited by 6 publications

References 30 publications

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval

Multimodal Features Alignment for Vision–Language Object Tracking

Multi-branch Deep Neural Model for Natural Language-Based Vehicle Retrieval

Contact Info

Product

Resources

About