This paper introduces a video dataset of spatiotemporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions;(2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips.AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
Abstract. Spectral domain optical coherence tomography (SD-OCT)is an important tool for the diagnosis of various retinal diseases. The measurements available from SD-OCT volumes can be used to detect structural changes in glaucoma patients before the resulting vision loss becomes noticeable. Eye movement during the imaging process corrupts the data, making measurements unreliable. We propose a method to correct for transverse motion artifacts in SD-OCT volumes after scan acquisition by registering the volume to an instantaneous, and therefore artifact-free, reference image. Our procedure corrects for smooth deformations resulting from ocular tremor and drift as well as the abrupt discontinuities in vessels resulting from microsaccades. We test our performance on 48 scans of healthy eyes and 116 scans of glaucomatous eyes, improving scan quality in 96% of healthy and 73% of glaucomatous eyes.
We couple occlusion modeling and multi-frame motion estimation to compute dense, temporally extended point trajectories in video with significant occlusions. Our approach combines robust spatial regularization with spatially and temporally global occlusion labeling in a variational, Lagrangian framework with subspace constraints. We track points even through ephemeral occlusions. Experiments demonstrate accuracy superior to the state of the art while tracking more points through more frames.
Abstract. We propose a new principle for recognizing fingerspelling sequences from American Sign Language (ASL). Instead of training a system to recognize the static posture for each letter from an isolated frame, we recognize the dynamic gestures corresponding to transitions between letters. This eliminates the need for an explicit temporal segmentation step, which we show is error-prone at speeds used by native signers. We present results from our system recognizing 82 different words signed by a single signer, using more than an hour of training and test video. We demonstrate that recognizing letter-to-letter transitions without temporal segmentation is feasible and results in improved performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.