![](https://cdn.sanity.io/images/2n305zeh/production/1cc58223d4645a5c3a253d521a0984e713b6b1dc-4896x3264.jpg?rect=0,527,4896,2210&w=1440&h=650&q=75&fit=max&auto=format)
Decision Variance in Risk-Averse Online Learning
Online learning has traditionally focused on the expected rewards. In this paper, a risk-averse online learning problem under the performance measure of the mean-variance of the rewards is studied. Both the bandit and full information settings are considered. The performance of several existing policies is analyzed, and new fundamental limitations on risk-averse learning is established. In particular, it is shown that although a logarithmic distribution-dependent regret in time T is achievable (similar to the risk-neutral problem), the worst-case (i.e. minimax) regret is lower bounded by Ω(T) (in contrast to the Ω(√T) lower bound in the risk-neutral problem). This sharp difference from the risk-neutral counterpart is caused by the the variance in the player's decisions, which, while absent in the regret under the expected reward criterion, contributes to excess mean-variance due to the non-linearity of this risk measure. The role of the decision variance in regret performance reflects a risk-averse player's desire for robust decisions and outcomes.