In Video Instance Segmentation (VIS), current approaches either focus on the quality of the results, by taking the whole video as input and processing it offline; or on speed, by handling it frame by frame at the cost of competitive performance. In this work, we propose an online method that is on par with the performance of the offline counterparts. We introduce a message-passing graph neural network that encodes objects and relates them through time. We additionally propose a novel module to fuse features from the feature pyramid network with residual connections. Our model, trained end-to-end, achieves state-of-the-art performance on the YouTube-VIS dataset within the online methods. Further experiments on DAVIS demonstrate the generalization capability of our model to the video object segmentation task. Code is available at: https://github.com/caganselim/TLTM
IntroductionVideo Instance Segmentation (VIS) is the task of concurrently detecting, segmenting, and tracking object instances in videos. The recent progress in VIS is mainly driven by large datasets [4,9,27,39,40] that allow solving these tasks together. The existing methods can be categorized as offline, when they take the whole video clip as input, or online, when they process each frame or pair of frames sequentially. Online methods typically follow the tracking-by-segmentation paradigm by first performing instance segmentation and then merging instances through time via an association algorithm [10,18,22,32,37,41]. While frame-by-frame processing is fast, it lacks temporal context which results in a large number of ID switches due to e.g. occlusions. In contrast, offline methods are able to better leverage the spatio-temporal information from all the frames in the video [1,2,19,35]. While temporal information leads to stronger performance, it hurts efficiency compared to the online counterparts, which might not be suitable for real-time applications.