This paper establishes problem-specific sample complexity lower bounds for linear system identification problems. The sample complexity is defined in the PAC framework: it corresponds to the time it takes to identify the system parameters with prescribed accuracy and confidence levels. By problem-specific, we mean that the lower bound explicitly depends on the system to be identified (which contrasts with minimax lower bounds), and hence really captures the identification hardness specific to the system. We consider both uncontrolled and controlled systems. For uncontrolled systems, the lower bounds are valid for any linear system, stable or not, and only depend of the system finite-time controllability gramian. A simplified lower bound depending on the spectrum of the system only is also derived. In view of recent finitetime analysis of classical estimation methods (e.g. ordinary least squares), our sample complexity lower bounds are tight for many systems. For controlled systems, our lower bounds are not as explicit as in the case of uncontrolled systems, but could well provide interesting insights into the design of control policy with minimal sample complexity.
We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair (i, j) ∈ [m] × [n] is selected, the learner observes a noisy sample of the (i, j)-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an ε-optimal policy with probability at least 1 − δ typically scales as m+n ε 2 log(1/δ). Our regret minimization algorithm enjoys minimax guarantees scaling as r 7/4 (m + n) 3/4 √ T , which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly r(m + n) and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.
We present a new finite-time analysis of the estimation error of the Ordinary Least Squares (OLS) estimator for stable linear time-invariant systems. We characterize the number of observed samples (the length of the observed trajectory) sufficient for the OLS estimator to be (ε, δ)-PAC, i.e., to yield an estimation error less than ε with probability at least 1 − δ. We show that this number matches existing sample complexity lower bounds [1, 2] up to universal multiplicative factors (independent of (ε, δ) and of the system). This paper hence establishes the optimality of the OLS estimator for stable systems, a result conjectured in [1]. Our analysis of the performance of the OLS estimator is simpler, sharper, and easier to interpret than existing analyses. It relies on new concentration results for the covariates matrix.
We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices A and B may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time T is upper bounded (i) by O((du + dx) √ dxT ) when A and B are unknown, (ii) by O(d 2x log(T )) if only A is unknown, and (iii) by O(dx(du + dx) log(T )) if only B is unknown and under some mild non-degeneracy condition (dx and du denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in T , dx and du as they match existing lower bounds in scenario (i) when dx ≤ du [SF20], and in scenario (ii) [Lai86]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting).Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on A and B and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of A and B and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.