In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated to the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.
In this paper, we propose a novel sensor selection scheme for networks equipped with energy harvesting sensing devices. Ultimately, the goal is to minimize the reconstruction distortion at the fusion center by selecting a reduced (i.e., sparse) yet informative enough subset of sensors. The solution must also fulfill the causality constraints associated to the energy harvesting process. For a classical formulation, the optimization problem turns out to be nonconvex. To circumvent that, we promote sparsity directly in the power allocation vector by introducing a log-sum penalty term in the cost function. The problem can be iteratively solved by resorting to majorization-minimization procedure leading to a stationary point of the solution. Numerical results reveal that, by using a logsum penalty term, the sensor selection scheme outperforms others based on the 1 norm while making an effective use of the harvested energy.
Abstract-In this paper, we investigate the problem of jointly selecting a predefined number of energy-harvesting (EH) sensors and computing the optimal power allocation. The ultimate goal is to minimize the reconstruction distortion at the fusion center. This optimization problem is, unfortunately, non-convex. To circumvent that, we propose two suboptimal strategies: (i) a joint sensor selection and power allocation (JSS-EH) scheme that, we prove, is capable of iteratively finding a stationary solution of the original problem from a sequence of surrogate convex problems; and (ii) a separate sensor selection and power allocation (SS-EH) scheme, on which basis we can identify a sensible sensor selection and analytically find a power allocation policy by solving a convex problem. We also discuss the interplay between the two strategies. Performance in terms of reconstruction distortion, impact of initialization, actual subsets of selected sensors and computed power allocation policies, etc., is assessed by means of computer simulations. To that aim, an EH-agnostic sensor selection strategy, a lower bound on distortion, and an online version of the SS-EH and JSS-EH schemes are derived and used for benchmarking.
In this paper, we investigate the reconstruction of time-correlated sources in a point-to-point communications scenario comprising an energy-harvesting sensor and a Fusion Center (FC). Our goal is to minimize the average distortion in the reconstructed observations by using data from previously encoded sources as side information. First, we analyze a delayconstrained scenario, where the sources must be reconstructed before the next time slot. We formulate the problem in a convex optimization framework and derive the optimal transmission (i.e., power and rate allocation) policy. To solve this problem, we propose an iterative algorithm based on the subgradient method. Interestingly, the solution to the problem consists of a coupling between a two-dimensional directional water-filling algorithm (for power allocation) and a reverse water-filling algorithm (for rate allocation). Then we find a more general solution to this problem in a delay-tolerant scenario where the time horizon for source reconstruction is extended to multiple time slots. Finally, we provide some numerical results that illustrate the impact of delay and correlation in the power and rate allocation policies, and in the resulting reconstruction distortion. We also discuss the performance gap exhibited by a heuristic online policy derived from the optimal (offline) one.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.