Reward-free exploration is a reinforcement learning setting recently studied by Jin et al. [17], who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994 [11], originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs O (SAH 4 /ε 2 ) log(1/δ) episodes to output, with probability 1 − δ, an ε-approximation of the optimal policy for any reward function. We empirically compare it to oracle strategies using a generative model. 1 We use the shorthand [n] = {1, . . . , n} for every integer n ∈ N * Preprint. Under review.
We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stagedependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the bias that Q-learning suffers while, at the same time, limiting the impact it has on the the second-order term of the regret. For UCBMQ, we are able to guarantee a regret of at most O( √ H 3 SAT + H 4 SA) where H is the length of an episode, S the number of states, A the number of actions, T the number of episodes and ignoring terms in poly log(SAHT ). Notably, UCBMQ is the first algorithm that simultaneously matches the lower bound of Ω( √ H 3 SAT ) for large enough T and has a second-order term (with respect to the horizon T ) that scales only linearly with the number of states S.1 It is the same reason why there is an extra factor S in the first order term of the bound of UCRL algorithm by Jaksch et al. (2010). This factor is "pushed" to the second-order term by the improved analysis of Azar et al. (2017).
In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which quantifies its level of non-stationarity. Our method generalizes previous approaches based on sliding windows and exponential discounting used to handle changing environments. We further propose a practical implementation of KeRNS, we analyze its regret and validate it experimentally.1 meaning Kernel-based Reinforcement Learning in Non-Stationary environments.Preprint. Under review.
Multifractal analysis allows us to study scale invariance and fluctuations of the pointwise regularity of time series. A theoretically well grounded multifractal formalism, based on wavelet leaders, was applied to electroencephalography (EEG) time series measured in healthy volunteers and epilepsy patients, provided by the University of Bonn. We show that the multifractal spectrum during a seizure indicates a lower global regularity when compared to non-seizure data and that multifractal features, combined with few baseline features, can be used to train a supervised learning algorithm to discriminate well above chance ictal (i.e. seizure) versus healthy and interictal epochs (97 %) and healthy controls versus patients (92 %).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.