Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475349
|View full text |Cite
|
Sign up to set email alerts
|

Capsule-based Object Tracking with Natural Language Specification

Abstract: Tracking with Natural-Language Specification (TNL) is a joint topic of understanding the vision and natural language with a wide range of applications. In previous works, the communication between two heterogeneous features of vision and language is mainly through a simple dynamic convolution. However, the performance of prior works is capped by the difficulty of linguistic variation of natural language in modeling the dynamically changing target and its surroundings. In the meanwhile, natural language and vis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 45 publications
(73 reference statements)
0
3
0
Order By: Relevance
“…In recent years, the two-stream framework [2], [4], [5], [8] has emerged as a dominant VL tracking paradigm (see Fig. 1(a)).…”
Section: A Vision-language Trackingmentioning
confidence: 99%
See 2 more Smart Citations
“…In recent years, the two-stream framework [2], [4], [5], [8] has emerged as a dominant VL tracking paradigm (see Fig. 1(a)).…”
Section: A Vision-language Trackingmentioning
confidence: 99%
“…In the past few years, two-stream VL trackers [2], [4], [5], [8], which extract visual features and language features separately and then perform feature interaction in a fusion model (as shown in Fig 1(a)), have emerged as a domain framework and obtained significant progresses. For instance, Feng et al [4] proposed a Siamese natural language region proposal network for multi-stage feature extraction, and then applied an aggregation module to dynamically combine predictions from both visual and language modalities.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation