“…The ability to process long sequences is critical for many Natural Language Processing tasks, including Document Summarization (Xiao and Carenini, 2019;Huang et al, 2021), Question Answering (Wang et al, 2020a), Information Extraction Du and Cardie, 2020;Ebner et al, 2020;Du et al, 2022), and Machine Translation (Bao et al, 2021). However, the quadratic computational cost of self-attention in transformer-based models limits their application in long-sequence tasks.…”