Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be ‘close’ to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish Õ(1/√N) convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.
The study of graphene-based antivirals is still at a nascent stage and the photothermal antiviral properties of graphene have yet to be studied. Here, we design and synthesize sulfonated magnetic nanoparticles functionalized with reduced graphene oxide (SMRGO) to capture and photothermally destroy herpes simplex virus type 1 (HSV-1). Graphene sheets were uniformly anchored with spherical magnetic nanoparticles (MNPs) of varying size between ∼5 and 25 nm. Fourier-transform infrared spectroscopy (FT-IR) confirmed the sulfonation and anchoring of MNPs on the graphene sheets. Upon irradiation of the composite with near-infrared light (NIR, 808 nm, 7 min), SMRGO (100 ppm) demonstrated superior (∼99.99%) photothermal antiviral activity. This was probably due to the capture efficiency, unique sheet-like structure, high surface area, and excellent photothermal properties of graphene. In addition, electrostatic interactions of MNPs with viral particles appear to play a vital role in the inhibition of viral infection. These results suggest that graphene composites may help to combat viral infections including, but not only, HSV-1.
Studies of nanoscale superconducting structures have revealed various physical phenomena and led to the development of a wide range of applications. Most of these studies concentrated on one- and two-dimensional structures due to the lack of approaches for creation of fully engineered three-dimensional (3D) nanostructures. Here, we present a ‘bottom-up’ method to create 3D superconducting nanostructures with prescribed multiscale organization using DNA-based self-assembly methods. We assemble 3D DNA superlattices from octahedral DNA frames with incorporated nanoparticles, through connecting frames at their vertices, which result in cubic superlattices with a 48 nm unit cell. The superconductive superlattice is formed by converting a DNA superlattice first into highly-structured 3D silica scaffold, to turn it from a soft and liquid-environment dependent macromolecular construction into a solid structure, following by its coating with superconducting niobium (Nb). Through low-temperature electrical characterization we demonstrate that this process creates 3D arrays of Josephson junctions. This approach may be utilized in development of a variety of applications such as 3D Superconducting Quantum interference Devices (SQUIDs) for measurement of the magnetic field vector, highly sensitive Superconducting Quantum Interference Filters (SQIFs), and parametric amplifiers for quantum information systems.
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider modelbased RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish Õ( √ S 2 AH 4 K) regret for stochastic rewards. Furthermore, we prove Õ( √ S 2 AH 4 K 2/3 ) regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.