“…Video object segmentation is gaining rapid development in computer vision. Most solutions are fundamentally supervised, as they rely on heavily pretrained models with human-labeled annotations [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16]. Although manual annotation is extremely costly, there are very few genuine unsupervised methods [17], [18], [19], [20], [21].…”