Gandharv Patil scite author profile

Gandharv Patil

4Publications

8Citation Statements Received

159Citation Statements Given

How they've been cited

How they cite others

159

Affiliations

Centre Universitaire de Mila, McGill University

Publications

Order By: Most citations

Variance Penalized On-Policy and Off-Policy Actor-Critic

Jain

Patil

Jain

et al. 2021

AAAI

View full text Add to dashboard Cite

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

show abstract

Variance Penalized On-Policy and Off-Policy Actor-Critic

Jain¹,

Patil²,

Jain³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

On learning history based policies for controlling Markov decision processes

Patil¹,

Mahajan²,

Precup³

2022

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) folklore suggests that history-based function approximation methods, such as recurrent neural nets or history-based state abstraction, perform be er than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been li le formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its e ectiveness on a set of continuous control tasks.

show abstract

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Patil¹,

Prashanth²,

Nagaraj³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal (1/ ) rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Gandharv Patil

Variance Penalized On-Policy and Off-Policy Actor-Critic

Variance Penalized On-Policy and Off-Policy Actor-Critic

On learning history based policies for controlling Markov decision processes

Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation

Contact Info

Product

Resources

About