“…The Transformer allows the attention for a token to be spread over the entire input sequence, multiple times, intuitively capturing different properties. This characteristic has led to a line of research focusing on the interpretation of Transformer-based networks and their attention mechanisms (Raganato and Tiedemann, 2018;Tang et al, 2018;Mareček and Rosa, 2019;Voita et al, 2019a;Vig and Belinkov, 2019;Clark et al, 2019;Kovaleva et al, 2019;Tenney et al, 2019;Lin et al, 2019;Jawahar et al, 2019;van Schijndel et al, 2019;Hao et al, 2019b;Rogers et al, 2020).…”