“…In addition, these works struggle in a multi-view scenario where pre-segmented images require consistent multi-frame object registration and tracking, since the segmentation and representation models work independently. More recently, several works [4,9,18,17,1] have succeeded in approximating the factorized posterior p(z 1 , z 2 , ..., z K |x) within the VAE framework, achieving impressive unsupervised object-level scene factorization. However, being single-view models, they fall victim to single-view spatial ambiguities.…”