In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, we propose a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks. In addition, we propose a novel algorithm called Mask-Track R-CNN for this task. Our new method introduces a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously. Finally, we evaluate the proposed method and several strong baselines on our new dataset. Experimental results clearly demonstrate the advantages of the proposed algorithm and reveal insight for future improvement. We believe the video instance segmentation task will motivate the community along the line of research for video understanding.
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at http://youtube-vos.org. We further evaluate several existing state-of-the-art video object segmentation algorithms on this dataset which aims to establish baselines for the development of new algorithms in the future.
Introduction The nuclear pore complex (NPC) is the primary gateway for transport of macromolecules between the nucleus and cytoplasm, serving as both a critical mediator and regulator of gene expression. NPCs are enormous ~120 MDa macromolecular machines embedded in the nuclear envelope, each containing ~1000 protein subunits, termed nucleoporins. Despite substantial progress in visualizing the overall shape of the NPC by cryoelectron tomography and in determining atomic resolution crystal structures of nucleoporins, the molecular architecture of the assembled NPC remains poorly understood, hindering the design of mechanistic studies that could investigate its many roles in cell biology. Rationale Existing cryoelectron tomographic reconstructions of the NPC remain too low in resolution to allow for de novo structure determination of the NPC or unbiased docking of nucleoporin fragment crystal structures. We sought to bridge this resolution gap by first defining the interaction network of the NPC, focusing on the evolutionarily conserved symmetric core. We developed protocols to reconstitute NPC protomers from purified, recombinant proteins, which enabled the generation of a high-resolution biochemical interaction map of the NPC symmetric core. We next determined high-resolution crystal structures of key nucleoporin interactions, providing spatial restraints for their relative orientation. Lastly, by superposing crystal structures that overlapped in sequence, we generated accurate full-length structures of the large scaffold nucleoporins. Supported by this biochemical data, we used sequential, unbiased searches to place the nucleoporin crystal structures into a previously determined cryoelectron tomographic reconstruction of the intact human NPC, thus generating a composite structure of the entire NPC symmetric core. Results Our analysis revealed that the inner and outer rings of the NPC utilize disparate mechanisms of interaction. While the structured coat nucleoporins of the outer ring form extensive surface contacts, the scaffold proteins of the inner ring are bridged by flexible sequences in linker nucleoporins. Our composite structure revealed a defined spoke architecture with limited cross-spoke interactions. Most nucleoporins are present in 32 copies, with notable exceptions of Nup170 and Nup188. Lastly, we observed the arrangement of the channel nucleoporins, which orient their N-termini into two sixteen-membered rings, ensuring that their N-terminal FG repeats project evenly into the central transport channel. Conclusion Our composite structure of the NPC symmetric core can be used as a platform for the rational design of experiments to probe NPC structure and function. Each nucleoporin occupies multiple distinct biochemical environments, explaining how such a large macromolecular complex can be assembled from a relatively small number of unique genes. Our integrated, bottom-up approach provides a paradigm for the biochemical and structural characterization of similarly large biological mega-assemb...
Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency. This has inspired research endeavors utilizing events to guide the challenging video superresolution (VSR) task. In this paper, we make the first attempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events. This is hampered by the difficulties of representing the spatial-temporal information of events when guiding VSR. To this end, we propose a novel framework that incorporates the spatial-temporal interpolation of events to VSR in a unified framework. Our key idea is to learn implicit neural representations from queried spatial-temporal coordinates and features from both RGB frames and events. Our method contains three parts. Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D features from events and RGB frames. Then, the Temporal Filter (TF) module unlocks more explicit motion information from the events near the queried timestamp and generates the 2D features. Lastly, the Spatial-Temporal Implicit Representation (STIR) module recovers the SR frame in arbitrary resolutions from the outputs of these two modules. In addition, we collect a real-world dataset with spatially aligned events and RGB frames. Extensive experiments show that our method significantly surpasses the prior-arts and achieves VSR with random scales, e.g., 6.5. Code and dataset are available at https: //vlis2022.github.io/cvpr23/egvsr.
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks longrange modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on nine benchmark datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.