Fan Yi scite author profile

In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, we propose a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks. In addition, we propose a novel algorithm called Mask-Track R-CNN for this task. Our new method introduces a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously. Finally, we evaluate the proposed method and several strong baselines on our new dataset. Experimental results clearly demonstrate the advantages of the proposed algorithm and reveal insight for future improvement. We believe the video instance segmentation task will motivate the community along the line of research for video understanding.

show abstract

NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results

Timofte

Agustsson

Gool³

et al. 2017

979

545

View full text Add to dashboard Cite

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Yang

et al. 2018

419

475

View full text Add to dashboard Cite

Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing video segmentation methods predominantly rely on static image segmentation techniques, and methods capturing temporal dependency for segmentation have to depend on pretrained optical flow models, leading to suboptimal solutions for the problem. End-to-end sequential learning to explore spatialtemporal features for video segmentation is largely limited by the scale of available video segmentation datasets, i.e., even the largest video segmentation dataset only contains 90 short video clips. To solve this problem, we build a new large-scale video object segmentation dataset called YouTube Video Object Segmentation dataset (YouTube-VOS). Our dataset contains 4,453 YouTube video clips and 94 object categories. This is by far the largest video object segmentation dataset to our knowledge and has been released at http://youtube-vos.org. We further evaluate several existing state-of-the-art video object segmentation algorithms on this dataset which aims to establish baselines for the development of new algorithms in the future.

show abstract

Architecture of the symmetric core of the nuclear pore

Lin

Stuwe

Schilbach

et al. 2016

Science

231

405

View full text Add to dashboard Cite

Introduction The nuclear pore complex (NPC) is the primary gateway for transport of macromolecules between the nucleus and cytoplasm, serving as both a critical mediator and regulator of gene expression. NPCs are enormous ~120 MDa macromolecular machines embedded in the nuclear envelope, each containing ~1000 protein subunits, termed nucleoporins. Despite substantial progress in visualizing the overall shape of the NPC by cryoelectron tomography and in determining atomic resolution crystal structures of nucleoporins, the molecular architecture of the assembled NPC remains poorly understood, hindering the design of mechanistic studies that could investigate its many roles in cell biology. Rationale Existing cryoelectron tomographic reconstructions of the NPC remain too low in resolution to allow for de novo structure determination of the NPC or unbiased docking of nucleoporin fragment crystal structures. We sought to bridge this resolution gap by first defining the interaction network of the NPC, focusing on the evolutionarily conserved symmetric core. We developed protocols to reconstitute NPC protomers from purified, recombinant proteins, which enabled the generation of a high-resolution biochemical interaction map of the NPC symmetric core. We next determined high-resolution crystal structures of key nucleoporin interactions, providing spatial restraints for their relative orientation. Lastly, by superposing crystal structures that overlapped in sequence, we generated accurate full-length structures of the large scaffold nucleoporins. Supported by this biochemical data, we used sequential, unbiased searches to place the nucleoporin crystal structures into a previously determined cryoelectron tomographic reconstruction of the intact human NPC, thus generating a composite structure of the entire NPC symmetric core. Results Our analysis revealed that the inner and outer rings of the NPC utilize disparate mechanisms of interaction. While the structured coat nucleoporins of the outer ring form extensive surface contacts, the scaffold proteins of the inner ring are bridged by flexible sequences in linker nucleoporins. Our composite structure revealed a defined spoke architecture with limited cross-spoke interactions. Most nucleoporins are present in 32 copies, with notable exceptions of Nup170 and Nup188. Lastly, we observed the arrangement of the channel nucleoporins, which orient their N-termini into two sixteen-membered rings, ensuring that their N-terminal FG repeats project evenly into the central transport channel. Conclusion Our composite structure of the NPC symmetric core can be used as a platform for the rational design of experiments to probe NPC structure and function. Each nucleoporin occupies multiple distinct biochemical environments, explaining how such a large macromolecular complex can be assembled from a relatively small number of unique genes. Our integrated, bottom-up approach provides a paradigm for the biochemical and structural characterization of similarly large biological mega-assemb...

show abstract

Robust Video Super-Resolution with Learned Temporal Dynamics

et al. 2017

View full text Add to dashboard Cite

Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency. This has inspired research endeavors utilizing events to guide the challenging video superresolution (VSR) task. In this paper, we make the first attempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events. This is hampered by the difficulties of representing the spatial-temporal information of events when guiding VSR. To this end, we propose a novel framework that incorporates the spatial-temporal interpolation of events to VSR in a unified framework. Our key idea is to learn implicit neural representations from queried spatial-temporal coordinates and features from both RGB frames and events. Our method contains three parts. Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D features from events and RGB frames. Then, the Temporal Filter (TF) module unlocks more explicit motion information from the events near the queried timestamp and generates the 2D features. Lastly, the Spatial-Temporal Implicit Representation (STIR) module recovers the SR frame in arbitrary resolutions from the outputs of these two modules. In addition, we collect a real-world dataset with spatially aligned events and RGB frames. Extensive experiments show that our method significantly surpasses the prior-arts and achieves VSR with random scales, e.g., 6.5. Code and dataset are available at https: //vlis2022.github.io/cvpr23/egvsr.

show abstract

Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining

Mei¹,

Yi²,

Zhou³

et al. 2020

271

146

View full text Add to dashboard Cite

Image Super-Resolution with Non-Local Sparse Attention

2021

View full text Add to dashboard Cite

VRT: A Video Restoration Transformer

Liang¹,

Cao²,

Yi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks longrange modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on nine benchmark datasets.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Fan Yi

Video Instance Segmentation

NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results

YouTube-VOS: Sequence-to-Sequence Video Object Segmentation

Architecture of the symmetric core of the nuclear pore

Robust Video Super-Resolution with Learned Temporal Dynamics

Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining

Image Super-Resolution with Non-Local Sparse Attention

VRT: A Video Restoration Transformer

Contact Info

Product

Resources

About