“…However, as its performance in terms of test likelihood and quality of generated samples was far from the desired one, many modifications were proposed in order to improved its performance on high-dimensional data like natural images. In general, one can obtain a tighter lower bound, and, thus, a more powerful and flexible model, by advancing over the following three elements: the encoder (Rezende et al, 2014;van den Berg et al, 2018;Hoogeboom et al, 2020;Maaløe et al, 2016), the prior (or marginal over latents) (Chen et al, 2016;Habibian et al, 2019;Lavda et al, 2020;Lin & Clark, 2020;Tomczak & Welling, 2017) and the decoder (Gulrajani et al, 2016). Nevertheless, recent studies have shown that by employing deep hierarchical architectures and by carefully designed the building blocks of the neural networks, VAEs can successful model large high-dimensional data and reach state-of-the-art test likelihoods (Zhao et al, 2017;Maaløe et al, 2019;Vahdat & Kautz, 2020).…”