2022
DOI: 10.1007/978-3-031-20074-8_15
|View full text |Cite
|
Sign up to set email alerts
|

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(9 citation statements)
references
References 44 publications
0
8
0
Order By: Relevance
“…Specifically, sketch-based video summarization aims to automatically generate storyboard sketches from video clips, which provides an interactive representation to annotate and visualize the major scene content of video clips [42] and supports flexibly editing or adding object sketches in a sketch-based interface. Furthermore, we will try CLIP [43] to simplify our SQ-GCN model inspired by [44], [32] and adapt it to scene sketches' feature encoding in the future.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, sketch-based video summarization aims to automatically generate storyboard sketches from video clips, which provides an interactive representation to annotate and visualize the major scene content of video clips [42] and supports flexibly editing or adding object sketches in a sketch-based interface. Furthermore, we will try CLIP [43] to simplify our SQ-GCN model inspired by [44], [32] and adapt it to scene sketches' feature encoding in the future.…”
Section: Discussionmentioning
confidence: 99%
“…Although TU-Berlin [9] and Sketchy [10] have a large amount of sketches and object categories, they cannot enable fine-grained instance-level retrieval due to the lack of instance-level matches. Most of the remaining datasets in Table I support the fine-grained cross-modal retrieval task, among which SketchyScene [12], SketchyCOCO [31] and FS-COCO [32] are capable of finegrained scene-level retrieval with multiple instances, yet they are all limited to the image domain. Compared with the video retrieval datasets TSF [3] and FG-SBVR [8], our dataset covers more object categories and contains more sketches, and the sketches in our dataset depict not only fine-grained single instances but also multiple objects in diverse scenes, which is more suitable for real-world sketch-related video research.…”
Section: E Dataset Analysismentioning
confidence: 99%
“…One trend that has emerged in the field is the use of attention mechanisms [ 36 , 37 , 38 , 39 , 40 ], which explored attention to effectively incorporate both global and local visual features in image captioning. Another trend in the field of image captioning focused on fine-grained details and object descriptions [ 41 , 42 , 43 ]. In recent studies, transformer models have also proven to be effective in several recent studies.…”
Section: Related Workmentioning
confidence: 99%
“…Nonetheless, the majority of these works [15,53] are restricted to using edgemaps as a pseudo sketch-replacement for model training. However, a free-hand sketch [62] with human-drawn sparse and abstract strokes, is a way of conveying the "semantic intent", and largely differs [58] from an edgemap. While edgemap perfectly aligns with photo boundaries, a sketch is a human abstraction of any object/concept, usually with strong deformations [58].…”
Section: Sketch-to-photo Generationmentioning
confidence: 99%