This paper examines the problem of dynamic traffic scene classification under space-time variations in viewpoint that arise from video captured on-board a moving vehicle. Solutions to this problem are important for realization of effective driving assistance technologies required to interpret or predict road user behavior. Currently, dynamic traffic scene classification has not been adequately addressed due to a lack of benchmark datasets that consider spatiotemporal evolution of traffic scenes resulting from a vehicle's ego-motion. This paper has three main contributions. First, an annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions. Second, we introduce novel and baseline algorithms that utilize semantic context and temporal nature of the dataset for dynamic classification of road scenes. Finally, we showcase algorithms and experimental results that highlight how extracted features from scene classification serve as strong priors and help with tactical driver behavior understanding. The results show significant improvement from previously reported driving behavior detection baselines in the literature.
Predictive models for forecasting future behavior of road agents should consider the multi-modal nature and be aware of the uncertainty of their predictions. Particularly from the egocentric view where the motion of other agents is captured with respect to the ego-motion, the uncertainty of ego-motion prediction is critical to determine their interactive reactions and behaviors. Along this line, we propose NEMO (Noisy Ego MOtion priors for future object localization) for future forecast of road agents in the egocentric view. A predictive distribution of future forecast is jointly modeled with the uncertainty of predictions. For this, we divide the problem into two tasks: future ego-motion prediction and future object localization. We first model the multi-modal distribution of future ego-motion with uncertainty estimates. The resulting distribution of egobehavior is used to sample multiple modes of future egomotion. Then, each modality is used as a prior to understand the interactions between the ego-vehicle and target agent. We predict the multi-modal future locations of the target from individual modes of the ego-vehicle, modeling the uncertainty of target's behavior. To this end, we extensively evaluate the proposed framework using the publicly available benchmark dataset (HEV-I) with an addition of Inertial Measurement Unit (IMU) data to it.
The ability to semantically interpret hand-drawn line sketches, although very challenging, can pave way for novel applications in multimedia. We propose SKETCH-PARSE, the first deep-network architecture for fully automatic parsing of freehand object sketches. SKETCHPARSE is configured as a two-level fully convolutional network. The first level contains shared layers common to all object categories. The second level contains a number of expert sub-networks. Each expert specializes in parsing sketches from object categories which contain structurally similar parts. Effectively, the two-level configuration enables our architecture to scale up efficiently as additional categories are added. We introduce a router layer which (i) relays sketch features from shared layers to the correct expert (ii) eliminates the need to manually specify object category during inference. To bypass laborious part-level annotation, we sketchify photos from semantic object-part image datasets and use them for training. Our architecture also incorporates object pose prediction as a novel auxiliary task which boosts overall performance while providing supplementary information regarding the sketch. We demonstrate SKETCHPARSE's abilities (i) on two challenging large-scale sketch datasets (ii) in parsing unseen, semantically related object categories (iii) in improving finegrained sketch-based image retrieval. As a novel application, we also outline how SKETCHPARSE's output can be used to generate caption-style descriptions for hand-drawn sketches.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.