“…The Space-Time Memory Network [35] memorizes intermediate frames with segmentation masks as references and performs pixellevel matching between them with the current frame to segment target objects in a bottom-up manner, which has been proved effective and has served as the current mainstream framework. Some works [40,23,5,15,59,41,51,6,62,46,25,27] further develop STM and have achieved excellent performance.…”