“…The problem of generating long audio token sequences can be addressed by at least three orthogonal approaches, or a combination thereof: i) efficient attention mechanisms (Kitaev et al, 2020;Choromanski et al, 2021;Xiong et al, 2021;Hawthorne et al, 2022), ii) non-autoregressive, parallel decoding schemes (Gu et al, 2017;Ghazvininejad et al, 2019;Chang et al, 2022), iii) custom architectures adapted to the special structure of the tokens produced by neural audio codecs (Kreuk et al, 2022;Wang et al, 2023;Lee et al, 2022). However, in the context of modeling the token sequence of neural audio codecs, either unconditionally or based on weak conditioning such as text, the efficient generation of long, high-quality audio segments remains an open problem.…”