2019 IEEE 58th Conference on Decision and Control (CDC) 2019
DOI: 10.1109/cdc40024.2019.9029461
|View full text |Cite
|
Sign up to set email alerts
|

Decision Variance in Risk-Averse Online Learning

Abstract: Online learning has traditionally focused on the expected rewards. In this paper, a risk-averse online learning problem under the performance measure of the mean-variance of the rewards is studied. Both the bandit and full information settings are considered. The performance of several existing policies is analyzed, and new fundamental limitations on risk-averse learning is established. In particular, it is shown that although a logarithmic distribution-dependent regret in time T is achievable (similar to the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 17 publications
0
3
0
Order By: Relevance
“…In that context, a common practice is to formulate the problem as one of the renowned Multi-armed bandit variants [40] and then conduct regret analysis, showing the expected total regret (defined as the gap in the total utility achieved by a given policy and a prophet optimal) is upper bounded by a certain function of the total time horizon [41][42][43][44]. A few recent works investigate the potential tradeoff between variance and regret in online learning; see, e.g., [45,46]. In particular, Vakili et al [46] introduced and analyzed the performance of several risk-averse policies in both bandit and full information settings under the metric of mean-variance [47].…”
Section: Main Techniques and Other Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In that context, a common practice is to formulate the problem as one of the renowned Multi-armed bandit variants [40] and then conduct regret analysis, showing the expected total regret (defined as the gap in the total utility achieved by a given policy and a prophet optimal) is upper bounded by a certain function of the total time horizon [41][42][43][44]. A few recent works investigate the potential tradeoff between variance and regret in online learning; see, e.g., [45,46]. In particular, Vakili et al [46] introduced and analyzed the performance of several risk-averse policies in both bandit and full information settings under the metric of mean-variance [47].…”
Section: Main Techniques and Other Related Workmentioning
confidence: 99%
“…A few recent works investigate the potential tradeoff between variance and regret in online learning; see, e.g., [45,46]. In particular, Vakili et al [46] introduced and analyzed the performance of several risk-averse policies in both bandit and full information settings under the metric of mean-variance [47].…”
Section: Main Techniques and Other Related Workmentioning
confidence: 99%
“…Mean-variance Bandit Literature. (Sani, Lazaric, and Munos 2012) open the mean-variance bandit literature which considers both the expected reward and variability into performance measures, and a series of follow-ups (Maillard 2013;Vakili, Boukouvalas, and Zhao 2019;Cardoso and Xu 2019) have emerged recently. To our best knowledge, this paper is the first to study the mean-variance bandit problem with conservative exploration.…”
Section: Related Workmentioning
confidence: 99%