Long-Span Summarization via Local Attention and Content Selection

Manakul, Potsawee; Gales, Mark J. F.

doi:10.18653/v1/2021.acl-long.470

Cited by 22 publications

(37 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with sequence, graph can aggregate relevant disjoint context by uniformly representing them as nodes and their relations as edges [136,191]. As a representative example, Wu et al [191] [76]; DANCER [53]; Manakul et al [130] Preservation CSP [203]; mRASP [112]; Wada et al [181] Fig. 2.…”

Section: Paragraph Representation Learningmentioning

confidence: 99%

“…Efficiency is an important factor to consider for modeling long documents, especially when generating long text. Since the self-attention mechanism grows quadratically with sequence length, many works aim to improve the encoding efficiency of selfattention [76,130]. A representative example is Manakul et al [130] proposed two methods: local self-attention, allowing longer input spans during training; and explicit content selection, reducing memory and compute requirements.…”

Section: Document Representation Learningmentioning

confidence: 99%

See 1 more Smart Citation

Pretrained Language Models for Text Generation: A Survey

Li¹,

Tang²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Text Generation aims to produce plausible and readable text in human language from input data. The resurgence of deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs). Grounding text generation on PLMs is seen as a promising direction in both academia and industry. In this survey, we present the recent advances achieved in the topic of PLMs for text generation. In detail, we begin with introducing three key points of applying PLMs to text generation: 1) how to encode the input data as representations preserving input semantics which can be fused into PLMs; 2) how to design a universal and performant architecture of PLMs served as generation models; and 3) how to optimize PLMs given the reference text and ensure the generated text satisfying special text properties. Then, we figure out several challenges and future directions within each key point. Next, we present a summary of various useful resources and typical text generation applications to work with PLMs. Finally, we conclude and summarize the contribution of this survey.CCS Concepts: • Computing methodologies → Natural language generation.

show abstract

Section: Paragraph Representation Learningmentioning

confidence: 99%

Section: Document Representation Learningmentioning

confidence: 99%

Pretrained Language Models for Text Generation: A Survey

Li¹,

Tang²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Vanilla Transformers. We use BART (Lewis et al, 2020) and local-attention BART (LoBART) (Manakul and Gales, 2021) as our base models. BART's maximum input length is 1024, while that of LoBART is 4096 with attention width of 1024.…”

Section: Models and Datamentioning

confidence: 99%

“…Time and memory are dominated by the encoder self-attention, and models such as LoBART adopt local attention in its encoder to mitigate this bottleneck, while keeping the original decoder (Manakul and Gales, 2021). Training is fast because attention is highly parallelizable.…”

Section: Attention In the Transformermentioning

confidence: 99%

“…A.1 Models BART/LoBART: We use the HuggingFace's implementation (Wolf et al, 2020), including BART models fine-tuned to CNNDM 5 and XSum 6 . We take LoBART from Manakul and Gales (2021), including LoBART(4k)+MCS fine-tuned to Podcast and arXiv. MCS is the multitask content selection system for handling the Podcast/arXiv input documents that exceed 4096 words.…”

Section: A Reproducibility Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems

Manakul

Gales

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Transformer models have achieved state-ofthe-art results in a wide range of NLP tasks including summarization. Training and inference using large transformer models can be computationally expensive. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. Modified encoder architectures such as LED or LoBART use local attention patterns to address this problem for summarization. In contrast, this work focuses on the transformer's encoder-decoder attention mechanism. The cost of this attention becomes more significant in inference or training approaches that require model-generated histories. First, we examine the complexity of the encoder-decoder attention. We demonstrate empirically that there is a sparse sentence structure in document summarization that can be exploited by constraining the attention mechanism to a subset of input sentences, whilst maintaining system performance. Second, we propose a modified architecture that selects the subset of sentences to constrain the encoder-decoder attention. Experiments are carried out on abstractive summarization tasks, including CNN/DailyMail, XSum, Spotify Podcast, and arXiv. 1

show abstract