Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.470
|View full text |Cite
|
Sign up to set email alerts
|

Long-Span Summarization via Local Attention and Content Selection

Abstract: Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pretrained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 22 publications
(37 citation statements)
references
References 33 publications
0
37
0
Order By: Relevance
“…Compared with sequence, graph can aggregate relevant disjoint context by uniformly representing them as nodes and their relations as edges [136,191]. As a representative example, Wu et al [191] [76]; DANCER [53]; Manakul et al [130] Preservation CSP [203]; mRASP [112]; Wada et al [181] Fig. 2.…”
Section: Paragraph Representation Learningmentioning
confidence: 99%
See 1 more Smart Citation
“…Compared with sequence, graph can aggregate relevant disjoint context by uniformly representing them as nodes and their relations as edges [136,191]. As a representative example, Wu et al [191] [76]; DANCER [53]; Manakul et al [130] Preservation CSP [203]; mRASP [112]; Wada et al [181] Fig. 2.…”
Section: Paragraph Representation Learningmentioning
confidence: 99%
“…Efficiency is an important factor to consider for modeling long documents, especially when generating long text. Since the self-attention mechanism grows quadratically with sequence length, many works aim to improve the encoding efficiency of selfattention [76,130]. A representative example is Manakul et al [130] proposed two methods: local self-attention, allowing longer input spans during training; and explicit content selection, reducing memory and compute requirements.…”
Section: Document Representation Learningmentioning
confidence: 99%
“…Vanilla Transformers. We use BART (Lewis et al, 2020) and local-attention BART (LoBART) (Manakul and Gales, 2021) as our base models. BART's maximum input length is 1024, while that of LoBART is 4096 with attention width of 1024.…”
Section: Models and Datamentioning
confidence: 99%
“…Time and memory are dominated by the encoder self-attention, and models such as LoBART adopt local attention in its encoder to mitigate this bottleneck, while keeping the original decoder (Manakul and Gales, 2021). Training is fast because attention is highly parallelizable.…”
Section: Attention In the Transformermentioning
confidence: 99%
See 1 more Smart Citation