Huazhe Xu scite author profile

Robust perception-action models should be learned from training data with diverse visual appearances and realistic behaviors, yet current approaches to deep visuomotor policy learning have been generally limited to in-situ models learned from a single vehicle or simulation environment. We advocate learning a generic vehicle motion model from large scale crowd-sourced video data, and develop an endto-end trainable architecture for learning to predict a distribution over future vehicle egomotion from instantaneous monocular camera observations and previous vehicle state. Our model incorporates a novel FCN-LSTM architecture, which can be learned from large-scale crowd-sourced vehicle action data, and leverages available scene segmentation side tasks to improve performance under a privileged learning paradigm. We provide a novel large-scale dataset of crowd-sourced driving behavior suitable for training our model, and report results predicting the driver action on held out sequences across diverse conditions.

show abstract

Natural Language Object Retrieval

Rohrbach

et al. 2016

504

507

View full text Add to dashboard Cite

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object. Natural language object retrieval differs from text-based image retrieval task as it involves spatial information about objects within the scene and global scene context. To address this issue, we propose a novel Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information into the network. Our model processes query text, local image descriptors, spatial configurations and global context features through a recurrent network, outputs the probability of the query text conditioned on each candidate box as a score for the box, and can transfer visual-linguistic knowledge from image captioning domain to our task. Experimental results demonstrate that our method effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.

show abstract

Disentangling Propagation and Generation for Video Prediction

Gao

Cai³

et al. 2019

View full text Add to dashboard Cite

A dynamic scene has two types of elements: those that move fluidly and can be predicted from previous frames, and those which are disoccluded (exposed) and cannot be extrapolated. Prior approaches to video prediction typically learn either to warp or to hallucinate future pixels, but not both. In this paper, we describe a computational model for high-fidelity video prediction which disentangles motion-specific propagation from motion-agnostic generation. We introduce a confidence-aware warping operator which gates the output of pixel predictions from a flow predictor for non-occluded regions and from a context encoder for occluded regions. Moreover, in contrast to prior works where confidence is jointly learned with flow and appearance using a single network, we compute confidence after a warping step, and employ a separate network to inpaint exposed regions. Empirical results on both synthetic and real datasets show that our disentangling approach provides better occlusion maps and produces both sharper and more realistic predictions compared to strong baselines.

show abstract

End-to-end Learning of Driving Models from Large-scale Video Datasets

Xu¹,

Gao²,

Yu³

et al. 2016

Preprint

View full text Add to dashboard Cite

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

Wang

et al. 2021

View full text Add to dashboard Cite

Multi-Task Reinforcement Learning with Soft Modularization

Yang¹,

Xu²,

Wu³

et al. 2020

Preprint

View full text Add to dashboard Cite

Multi-task learning is a very challenging problem in reinforcement learning. While training multiple tasks jointly allow the policies to share parameters across different tasks, the optimization problem becomes non-trivial: It is unclear what parameters in the network should be reused across tasks, and the gradients from different tasks may interfere with each other. Thus, instead of naively sharing parameters across tasks, we introduce an explicit modularization technique on policy representation to alleviate this optimization issue. Given a base policy network, we design a routing network which estimates different routing strategies to reconfigure the base network for each task. Instead of creating a concrete route for each task, our task-specific policy is represented by a soft combination of all possible routes. We name this approach soft modularization. We experiment with multiple robotics manipulation tasks in simulation and show our method improves sample efficiency and performance over baselines by a large margin. Our project page is at: https: //rchalyang.github.io/SoftModule.

show abstract

Natural Language Object Retrieval

Rohrbach

et al. 2015

Preprint

View full text Add to dashboard Cite

BeBold: Exploration Beyond the Boundary of Explored Regions

Zhang¹,

Xu²,

Wang³

et al. 2020

Preprint

View full text Add to dashboard Cite

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. To guide exploration, previous work makes extensive use of intrinsic reward (IR). There are many heuristics for IR, including visitation counts, curiosity, and state-difference. In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR. The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment. The resulting method, Be-Bold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning. In comparison, previous SoTA only solves 50% of the tasks. BeBold also achieves SoTA on multiple tasks in NetHack, a popular rogue-like game that contains more challenging procedurally-generated environments.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Huazhe Xu

End-to-End Learning of Driving Models from Large-Scale Video Datasets

Natural Language Object Retrieval

Disentangling Propagation and Generation for Video Prediction

End-to-end Learning of Driving Models from Large-scale Video Datasets

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

Multi-Task Reinforcement Learning with Soft Modularization

Natural Language Object Retrieval

BeBold: Exploration Beyond the Boundary of Explored Regions

Contact Info

Product

Resources

About