Marvin Zhang scite author profile

Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of modelgenerated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.Preprint. Under review.

show abstract

Deep reinforcement learning for tensegrity robot locomotion

Zhang

Geng

Bruce

et al. 2017

View full text Add to dashboard Cite

Tensegrity robots, composed of rigid rods connected by elastic cables, have a number of unique properties that make them appealing for use as planetary exploration rovers. However, control of tensegrity robots remains a difficult problem due to their unusual structures and complex dynamics. In this work, we show how locomotion gaits can be learned automatically using a novel extension of mirror descent guided policy search (MDGPS) applied to periodic locomotion movements, and we demonstrate the effectiveness of our approach on tensegrity robot locomotion. We evaluate our method with realworld and simulated experiments on the SUPERball tensegrity robot, showing that the learned policies generalize to changes in system parameters, unreliable sensor measurements, and variation in environmental conditions, including varied terrains and a range of different gravities. Our experiments demonstrate that our method not only learns fast, power-efficient feedback policies for rolling gaits, but that these policies can succeed with only the limited onboard sensing provided by SUPERball's accelerometers. We compare the learned feedback policies to learned open-loop policies and hand-engineered controllers, and demonstrate that the learned policy enables the first continuous, reliable locomotion gait for the real SUPERball robot. Our code and other supplementary materials are available from http://rll.berkeley.edu/drl_tensegrity * These authors contributed equally to this work.

show abstract

Learning deep neural network policies with continuous memory states

Zhang

McCarthy

Finn

et al. 2016

View full text Add to dashboard Cite

Abstract-Policy learning for partially observed control tasks requires policies that can remember salient information from past observations. In this paper, we present a method for learning policies with internal memory for high-dimensional, continuous systems, such as robotic manipulators. Our approach consists of augmenting the state and action space of the system with continuous-valued memory states that the policy can read from and write to. Learning general-purpose policies with this type of memory representation directly is difficult, because the policy must automatically figure out the most salient information to memorize at each time step. We show that, by decomposing this policy search problem into a trajectory optimization phase and a supervised learning phase through a method called guided policy search, we can acquire policies with effective memorization and recall strategies. Intuitively, the trajectory optimization phase chooses the values of the memory states that will make it easier for the policy to produce the right action in future states, while the supervised learning phase encourages the policy to use memorization actions to produce those memory states. We evaluate our method on tasks involving continuous control in manipulation and navigation settings, and show that our method can learn complex policies that successfully complete a range of tasks that require memory.

show abstract

AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

Smith

Dhawan

Zhang

et al. 2020

View full text Add to dashboard Cite

show abstract

AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

Smith¹,

Dhawan²,

Zhang³

et al. 2019

Preprint

View full text Add to dashboard Cite

Fig. 1: Schematic of the overall method. Left: Human instructions for each stage (top) are translated at the pixel level into robot instructions (bottom) via the CycleGAN. Note the artifacts in the generated translations, e.g., the displaced robot gripper in the bottom right image. Right: The robot attempts the task stage-wise, automatically resetting and retrying until the instruction classifier signals success, which prompts the human to confirm via a key press.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Marvin Zhang

When to Trust Your Model: Model-Based Policy Optimization

Deep reinforcement learning for tensegrity robot locomotion

Learning deep neural network policies with continuous memory states

AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

Contact Info

Product

Resources

About