Movie Summarization via Sparse Graph Construction

Papalampidi, Pinelopi; Keller, Frank; Lapata, Mirella

doi:10.1609/aaai.v35i15.17607

Cited by 12 publications

(3 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We segment episodes into shots (using PySceneDetect 4 ) and map these to utterances in the corresponding transcript. Specifically, we align the closed captions in the video which are timestamped to the utterances in the transcript using Dynamic Time Warping (DTW; Myers and Rabiner 1981; Papalampidi et al 2021b). We thus create a one-to-many alignment where an utterance corresponds to one or more shots.…”

Section: Multimodal Augmentationmentioning

confidence: 99%

“…Previous work on video-to-video summarization identifies highlights from YouTube videos, TV shows, or movies (Song et al, 2015;Gygli et al, 2014;De Avila et al, 2011;Papalampidi et al, 2021b). However, in most cases, either the videos are short or the datasets are small with a few hundred examples.…”

Section: Introductionmentioning

confidence: 99%

“…We build our model on top of a pre-trained sequence-to-sequence architecture (i.e., BART; Lewis et al 2020) fine-tuned on summarization and capable of generating fluent long text. We convert its textual encoder to a multimodal one by adding and tuning adapter layers (Rebuffi et al, 2017;Houlsby et al, 2019), Modality Input Output Datasets text-to-text text short short XSum (Narayan et al, 2018), CNN-DailyMail (Nallapati et al, 2016), NYT (Durrett et al, 2016), Gigaword (Napoles et al, 2012) text long long SamSum (Gliwa et al, 2019), QMSum (Zhong et al, 2021), SummScreen video-to-video vision short short OVP (De Avila et al, 2011), YouTube (De Avila et al, 2011), SumMe (Gygli et al, 2014) vision/text short short TVSum (Song et al, 2015) vision/text(/audio) long long LoL (Fu et al, 2017) TRIPOD+ (Papalampidi et al, 2021b) video-to-text vision long short TACoS (Rohrbach et al, 2014) vision/text/audio short short How2 (Sanabria et al, 2018) vision/text/audio long long SummScreen 3D…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical3D Adapters for Long Video-to-text Summarization

Papalampidi,

Lapata

2023

Findings of the Association for Computational Linguistics: EACL 2023

View full text Add to dashboard Cite

In this paper, we focus on video-to-text summarization and investigate how to best utilize multimodal information for summarizing long inputs (e.g., an hour-long TV show) into long outputs (e.g., a multi-sentence summary). We extend SummScreen (Chen et al., 2022), a dialogue summarization dataset consisting of transcripts of TV episodes with reference summaries, and create a multimodal variant by collecting corresponding full-length videos. We incorporate multimodal information into a pretrained textual summarizer efficiently using adapter modules augmented with a hierarchical structure while tuning only 3.8% of model parameters. Our experiments demonstrate that multimodal adapters outperform more memoryheavy and fully fine-tuned textual summarization methods.

show abstract

Section: Multimodal Augmentationmentioning

confidence: 99%