Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology 2020
DOI: 10.1145/3379337.3415882
|View full text |Cite
|
Sign up to set email alerts
|

Crosscast: Adding Visuals to Audio Travel Podcasts

Abstract: Figure 1. Given audio travel podcasts and transcripts (e.g. travel to Tokyo and Sydney), Crosscast automatically selects the most relevant locations and visual entities at any moment of a podcast, and queries and displays images to accompany the audio, enabling audiovisual travel storytelling experience.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 33 publications
(14 citation statements)
references
References 27 publications
0
14
0
Order By: Relevance
“…It segments recordings into pieces for each line of dialogue. [18] obtains transcripts from audio using rev.com and spots locations and visually significant entities(VSEs) using Google NLP toolkit 6 . [75] uses titles as import cues to summarize videos.…”
Section: Textmentioning
confidence: 99%
See 1 more Smart Citation
“…It segments recordings into pieces for each line of dialogue. [18] obtains transcripts from audio using rev.com and spots locations and visually significant entities(VSEs) using Google NLP toolkit 6 . [75] uses titles as import cues to summarize videos.…”
Section: Textmentioning
confidence: 99%
“…For example, automatic speech recognition techniques cannot meet the expectation of professional editors. They ususally require a perfect transcript from video providers or crowdsource [18][19] [20]. The corelations among multi-modality is not well investigated, though some researches have attempted [21].…”
Section: Introductionmentioning
confidence: 99%
“…It segments recordings into pieces for each line of dialogue. [18] obtains transcripts from audio using rev.com and spots locations and visually significant entities(VSEs) using Google NLP toolkit 6 . [75] uses titles as import cues to summarize videos.…”
Section: Textmentioning
confidence: 99%
“…For example, automatic speech recognition techniques cannot meet the expectation of professional editors. They usually require a perfect transcript from video providers or crowdsource [18][19] [20]. The correlations among multi-modality are not well investigated, though some researchers have attempted [21].…”
Section: Introductionmentioning
confidence: 99%
“…Crosscast utilized heuristic-based algorithms to extract relevant information from audio transcripts for travel podcasts, compose search queries, and retrieve relevant visual content to augment audio travel podcast [56]. One limitation of using automatically generated content is that the visual styles of content are limited.…”
Section: Visual Content Generation From Natural Languagementioning
confidence: 99%