# Second Order Regret Bounds Parametrized By Variance Across Actions And Top Epsilon Percentile

Second order regret bounds parametrized by variance across actions and top ${\epsilon}$ percentile

This blog will describe an open problem posted by Yoav Freund[1]. We’ll introduce the problem setting and an overview of regret bounds in section 1. In section 2, we’ll define variance across actions and how it comes into play. Then we’ll introduce ${\epsilon}$ percentile in section 3. Section 4 will put forward the open problem.

1. Introduction

This study focuses on extrenal regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action.

1.1. Problem setting

The forecasting game concerned here is played in rounds. At each time step ${t=1, 2, \cdots}$, the forecaster computes an assignment

$\displaystyle {\bf p}_t=(p_{1,t},\cdots,p_{N,t})$

which are probabilities over the ${N}$ actions. Then the payoff vector for time ${t}$ is revealed:

$\displaystyle {\bf x_t} = (x_{1,t},\cdots,x_{N,t})\in{\mathbb R}^N.$

The forecaster’s reward is

$\displaystyle \hat{x}_t=x_{1,t}p_{1,t} + \cdots + x_{N,t}p_{N,t}.$

We define the cumulative reward of the forecaster by

$\displaystyle \hat{X}_n=\hat{x}_1 + \cdots + \hat{x}_n$

and the cumulative payoff of action ${i}$ by

$\displaystyle X_{i,n} = x_{i,1} + \cdots + x_{i,n}.$

For all ${n}$, let ${X_n^*=max_{i=1,\cdots,N}X_{i,n}}$ be the cumulative payoff of the best action up to time ${n}$. The goal is to keep the ${regret=X_n^*-\hat{X}_n}$ as small as possible uniformly over ${n}$.

1.2. An overview of regret bounds

There are several kinds of regret bounds with repect to the forecasting games as follows:

Zero-order bounds: we say that a bound is of order zero whenever it only depends on bounds on the payoffs (or on the payoff ranges) and on the number of time step ${n}$. The basic version of the exponentially weighted average forecaster of Littlestone and Warmuth[2] ensures the order of magnitude of the regret is

$\displaystyle M\sqrt{n\ln{N}},$

where ${M}$ is the bound on the payoffs:

$\displaystyle |x_{i,t}|\leq M, \forall t\geq 1, i =1,\cdots,N.$

Actually, the factor ${M}$ can be replaced by a bound ${E}$ on the effective ranges of the payoffs, defined by

$\displaystyle |x_{i,t}-x_{j,t}|\leq E, \forall t\geq 1, i,j=1,\cdots,N.$

To reach this basic version of the regret bound, it assumes that we have prior knowledge of both ${n}$ and ${M}$ (or ${E}$), so as to tune the learning rate.

First-order bounds: a bound is first order whenever its main term depends on a sum of payoffs. We can define

$\displaystyle A_n^*=|x_{k_n^*,1}|+\cdots+|x_{k_n^*,n}|$

which is the sum of the obsolute values of the payoffs of the best expert ${k_n^*}$ for rounds ${1,\cdots,n}$. Note that the payoffs could be positive or negative. Allenberg-Neeman[3] gave a simple algorithm whose regret is of the order of

$\displaystyle \sqrt{MA_n^*\ln{N}}+M\ln{N}.$

Since the ${A_n^*\leq Mn}$, first-order bounds are usually sharper than zero-order bounds.

Second-order bounds: a regret bound is second-order whenever its main term is a function of a sum of squared payoffs (or on a quantity that is homogeneous in such a sum). Ideally, they are a function of

$\displaystyle Q_n^*=\sum_{t=1}^nx_{k_n^*,t}^2.$

Cesa-Bianchi[4] gave a simple algorithm whose regret is of the order of

$\displaystyle \sqrt{Q_n^*\ln{N}} + M\ln{N}.$

Since ${Q_n^*\leq MA_n^*}$, this bound improves the first-order bounds.

2. Variance across actions

If the payoffs are uniformly bounded, and there are infinitely many actions, then there exists simple forecasting strategies whose external regret per time step vanishes irrespective to the choice of the payoff sequence. However, if the payoffs were generated by an independent stochastic process, the tightest rate for the regret with respect to a fixed action should depend on variance (rather than the average) of the observed payoffs for that action. Cesa-Bianchi[4] provided an algorithm that makes a step towards it.

Definitions: Let ${Z_t}$ be the random variable with range ${\{x_{1,t},\cdots,x_{N,t}\}}$ and distribution ${{\bf p}_t}$. Introduce

$\displaystyle Var Z_t = \mathop{\mathbb E} Z_t^2 - \mathop{\mathbb E}^2Z_t = \sum_{i=1}^Np_{i,t}x_{i,t}^2 - (\sum_{i=1}^Np_{i,t}x_{i,t})^2.$

Hense ${Var Z_t}$ is the variance of the payoffs at time ${t}$ under the distribution ${\mathbf{p}_t}$ and the cumulative variance is defined as

$\displaystyle V_n=Var Z_1 + \cdots + Var Z_n$

which is the main second-order quantity used later.

The regret bounds parameterized by variance is derived for the weighted majority forecaster of Littlestone and Warmuth[1] using a time-varing learning rate. Because of the way the learning rate is chosen, this bound actually involves the way the algorithm predicts.

The weighted majority forecaster works as follows: it uses the sequence ${\eta_2, \eta_3, \cdots > 0}$ of learning rates assigns at time ${t}$ a probability distribution ${{\bf p}_t}$ over the ${N}$ actions defined by ${p_1=(1/N,\cdots,1/N)}$ and

$\displaystyle p_{i,t} = \frac{e^{\eta_{t}X_{i,t-1}}}{\sum_{j=1}^N e^{\eta_{t}X_{j,t-1}}}, t\geq2, i=1,\cdots,N.$

Note that the quantities ${\eta_t}$ may depend on the past payoffs.

Let’s first introduce a potential function

$\displaystyle \begin{array}{rcl} \Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t) &=& -\sum_{i=1}^Np_{i,t}x_{i,t} + \frac{1}{\eta_t}\ln{\sum_{i=1}^Np_{i,t}e^{\eta_{t}x_{i,t}}}\\ &=& \frac{1}{\eta_t}\ln{(\sum_{i=1}^Np_{i,t}e^{\eta_t(x_{i,t}-\hat{x}_t)})}. \end{array}$

The following lemma bounds the potential function in terms of variance.

Lemma 1 For all payoff vectors ${\mathbf{x}_t = (x_{1,t},\cdots,x_{N,t})}$, all probability distributions ${\mathbf{p}_t=(p_{1,t},\cdots,p_{N,t})}$, and all learning rates ${\eta_t\geq0}$, we have

$\displaystyle \Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t) \leq E \ \ \ \ \ (1)$

where ${E}$ is such that ${|x_{i,t}-x_{j,t}|\leq E, \forall i,j=1,\cdots,N}$. If, in addition, ${0\leq\eta_t|x_{i,t}-x_{j,t}|\leq1, \forall i,j=1,\cdots,N}$, then

$\displaystyle \Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t) \leq (e-2)\eta_t Var Z_t. \ \ \ \ \ (2)$

Proof: To prove the first inequality, we have

$\displaystyle \begin{array}{rcl} \Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t) &=& \frac{1}{\eta_t}\ln{(\sum_{i=1}^Np_{i,t}e^{\eta_t(x_{i,t}-\hat{x}_t)})}\\ &\leq& \frac{1}{\eta_t}\ln{(\sum_{i=1}^Np_{i,t}e^{\eta_t E})}\\ &=& \frac{1}{\eta_t}\ln{(e^{\eta_t E}\sum_{i=1}^Np_{i,t})}\\ &=& E. \end{array}$

To prove the second one, we use ${\exp{a} \leq 1+a+(e-2)a^2}$ for ${|a|\leq1}$.

$\displaystyle \begin{array}{rcl} \Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t) &\leq& \frac{1}{\eta_t}\ln{(\sum_{i=1}^Np_{i,t}(1+\eta_t(x_{i,t}-\hat{x}_t)+(e-2)\eta_t^2(x_{i,t}-\hat{x}_t)^2))}\\ &=& \frac{1}{\eta_t}\ln{(1 + (e-2)\eta_t^2Var Z_t)}\\ &\leq& (e-2)\eta_t Var Z_t. \end{array}$

The last step uses inequality ${\ln{(1+a)}\leq a, \forall a>-1}$. $\Box$

The following lemma is used to prove the main theorem later. The proof of this lemma is omitted but can be found in [4,5].

Lemma 2 Consider any nonincreasing sequence ${\eta_2, \eta_3,\cdots}$ of positive learning rates and any sequence ${\mathbf{x}_1, \mathbf{x}_2,\cdots\in {\mathbb R}^N}$ of payoff vectors. Then the weighted majority forecaster run with the sequence ${\eta_2, \eta_3,\cdots}$ satisfies, for any ${n\geq1}$ and for any ${\eta_1\geq\eta_2}$,

$\displaystyle \hat{X}_n-X_n^* \geq -(\frac{2}{\eta_{n+1}}-\frac{1}{\eta_1})\ln{N}-\sum_{t=1}^N\Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t).$

We now introduce time-varing learning rate based on ${V_n}$. The sequence ${\eta_2,\eta_3,\cdots}$ is defined as

$\displaystyle \eta_t=\min\{\frac{1}{E}, C\sqrt{\frac{\ln{N}}{V_{t-1}}}\}$

for ${t\geq2}$, with ${C=\sqrt{2(\sqrt{2}-1)/(e-2)}}$. Note that ${\eta_t}$ depends on the forecaster’s past predictions.

Theorem 3 Provide a bound ${E}$ on the payoff ranges is known beforehand, i.e., ${max_{t=1,\cdots,n}max_{i,j=1,\cdots,N}|x_{i,t}-x_{j,t}|\leq E}$, the weighted majority forecaster using the time-varing learning rate achieves, for all sequences of payoffs and for all ${n\geq1}$,

$\displaystyle \hat{X}_n-X_n^*\geq-4\sqrt{V_n\ln{N}}-2E\ln{N}-E/2.$

Proof: By using Lemma 1 and 2, and setting ${\eta_1=\eta_2}$ for the analysis, we have

$\displaystyle \begin{array}{rcl} \hat{X}_n-X_n^* &\geq& -(\frac{2}{\eta_{n+1}}-\frac{1}{\eta_1})\ln{N}-\sum_{t=1}^N\Phi(\mathbf{p}_t, \eta_t, \mathbf{x}_t)\\ &\geq& -2\max\{E\ln{N},\frac{1}{C}\sqrt{V_n\ln{N}}\}-(e-2)\sum_{t=1}^n\eta_tVar Z_t. \end{array}$

We now try to bound the second term. Suppose ${T}$ the round that the first time step t when ${V_t>\frac{E^2}{4}}$. Note that

$\displaystyle \forall t>0, VarZ_t \leq E^2/4$

so

$\displaystyle V_T=V_{T-1}+VarZ_T\leq\frac{E^2}{4}+\frac{E^2}{4}=\frac{E^2}{2}.$

Also we have ${V_t\leq V_{t-1} + \frac{E^2}{4}}$ and ${V_{t-1}\geq\frac{E^2}{4}}$ for ${t>T}$. It’ll give us ${\frac{V_t}{V_{t-1}}\leq 2}$ for ${t>T}$. Since ${\eta_t\leq \frac{1}{E}}$ and ${\eta_t\leq C\sqrt{\frac{\ln{N}}{V_{t-1}}}}$, we have

$\displaystyle \begin{array}{rcl} \sum_{t=1}^n\eta_tVar Z_t &\leq& \frac{1}{E}V_T + C\sqrt{\ln{N}}\sum_{t=T+1}^n\frac{V_t - V_{t-1}}{\sqrt{V_{t-1}}}\\ &\leq& \frac{E}{2} + C\sqrt{\ln{N}}\sum_{t=T+1}^n\frac{\sqrt{V_t}-\sqrt{V_{t-1}}}{\sqrt{V_{t-1}}}(\sqrt{V_t}-\sqrt{V_{t-1}})\\ &\leq& \frac{E}{2} + C\sqrt{\ln{N}}\sum_{t=T+1}^n(\sqrt{2}+1)(\sqrt{V_t}-\sqrt{V_{t-1}})\\ &=&\frac{E}{2} + \frac{C}{\sqrt{2}-1}\sqrt{V_n\ln{N}}. \end{array}$

So we get

$\displaystyle \hat{X}_n-X_n^* \geq -2\max\{E\ln{N},\frac{1}{C}\sqrt{V_n\ln{N}}\} - \frac{e-2}{2}E - \frac{C(e-2)}{\sqrt{2}-1}\sqrt{V_n\ln{N}}.$

It’s easy to see that when ${E\ln{N} \leq \frac{1}{C}\sqrt{V_n\ln{N}}}$, which means ${\sqrt{V_n} \geq CE\sqrt{\ln{N}}}$, the regret is bouned by

$\displaystyle -4\sqrt{V_n\ln{N}} - \frac{E}{2}.$

When ${\sqrt{V_n} < CE\sqrt{\ln{N}}}$, the regret is bounded by

$\displaystyle -2E\ln{N} - 2\sqrt{V_n\ln{N}} - \frac{E}{2}.$

Putting things together, it concludes the proof. $\Box$

Theorem 3 tells us that the leading term of regret can be bounded as the order of ${\sqrt{V_n\ln{N}}}$. ${V_n}$ is the second-order quantity associated to the sequence of payoffs. However, the precise defition of these quantities makes the bounds generally not comparable to the bounds shown in section 1. One would prefer the bound in this section in case of randomized prediction, in which the forecaster draws an action ${I_t}$ at random according to ${\mathbf{p}_t}$ at every round (${regret=x_{I_1,1}+\cdots+x_{I_n,n}-X_n*}$), since the bounds in section 1 are more explict in terms of the payoff sequence as they do not involve the way the algorithm predicts.

3. Top ${\epsilon}$ percentile

Top ${\epsilon}$-quantile was first introduced by Chaudhuri, Freund and Hsu[6] and has been studied by several groups[7,8,9]. It was used to solve the issue with using online-learning in problems when the number of actions(${N}$) is very large.

In this section, we are considering the cumulative losses instead of payoffs. When the number of actions is very large, the regret to the best action is not an effective measure of performance. In some situations, one expects to have a lot of actions that are close to the action with the lowest loss or highest payoff. As these actions also have low losses, measuring performance with respect to a small group of actions that perform well is extremly reasonable.

Here is a notion of ${\epsilon}$-quantile, which is more natural for practical applications. We order the cumulative losses of all actions from lowest to highest and define the regret of the forecaster to the top ${\epsilon}$-quantile to be the difference between cumulative loss of the forecaster and the ${\lfloor \epsilon N\rfloor}$-th element in the sorted list Chaudhuri, Freund and Hsu[6] provided an algorithm, so called Normal-Hedge, which gives the regret to the top ${\epsilon}$-quantile of actions at most

$\displaystyle O(\sqrt{n\ln{\frac{1}{\epsilon}}} + \ln^2N).$

It holds for all ${T}$ and ${\epsilon}$. If we set ${\epsilon = \frac{1}{N}}$, we get that the regret to the best action is upper-bounded by ${O(\sqrt{n\ln{N}} + \ln^2N)}$.

3.1. Setting

We assume that the losses lie in an interval of length 1. In round ${t}$, each action ${i}$ incurs a loss ${l_{i,t}}$, and the forecaster incurs the expected loss:

$\displaystyle l_{A,t}=\sum_{i=1}^Np_{i,t}l_{i,t}.$

The forecaster’s instantaneous regret to an action ${i}$ in round ${t}$ is ${r_{i,t}=l_{A,t}-l_{i,t}}$, and the cumulative regret to an action ${i}$ in the first ${t}$ rounds is

$\displaystyle R_{i,t}=\sum_{\tau}^tr_{i,\tau}.$

So the regret (of the forecaster or learner) to an action is the difference between the forecaster’s cumulative loss and the cumulative loss of that action.

3.2. Normal-Hedge

Normal-Hedge algorithm is based on a potential function

$\displaystyle \phi(x,c)=\exp{(\frac{([x]_+)^2}{2c})}, x\in{\mathbb R}, c>0$

where ${[x]_+}$ denotes ${\max\{0,x\}}$. The algorithm keeps tracking the cumulative regrets ${R_{i,t}}$ to each action ${i}$ after each round ${t}$, the algorithm also maintains a scale parameter ${c_t}$, which is chosen so that the average of the potential remains constant at ${e}$:

$\displaystyle \frac{1}{N}\sum_{i=1}{N}\exp{(\frac{([R_{i,t}]_+)^2}{2c_t})} = e.$

${c_t}$ can be determined with a line search.

The weight assigned to ${i}$ in round ${t}$ is set proportional to the first-derivative of the potential, evaluated at ${R_{i,t-1}}$ and ${c_{t-1}}$:

$\displaystyle p_{i,t} \propto \frac{\partial}{\partial x}\phi(x,c)|_{x=R_{i,t-1},c=c_{t-1}} = \frac{[R_{i,t-1}]_+}{c_{t-1}}\exp(\frac{([R_{i,t-1}]_+)^2}{2c_{t-1}}).$

The full algorithm is shown in below:

Initially: Set ${R_{i,0} = 0, p_{i,1} = 1/N}$ for each ${i}$.

For ${t=1,2,\cdots}$

1. Each action ${i}$ incurs loss ${l_{i,t}}$.
2. Forcaster incurs loss ${l_{A,t}=\sum_{i=1}^Np_{i,t}l_{i,t}}$.
3. Update cumulative regrets: ${R_{i,t}=R_{i,t-1}+(l_{A,t}-l_{i,t})}$ for each ${i}$.
4. Find ${c_t>0}$ satisfying ${\frac{1}{N}\sum_{i=1}^N\exp(\frac{([R_{i,t}])^2}{2c_t})=e}$.
5. Update distribution for round ${t+1: p_{i,t+1}\propto \frac{[R_{i,t-1}]_+}{c_{t-1}}\exp(\frac{([R_{i,t-1}]_+)^2}{2c_{t-1}})}$ for each ${i}$.

Notice that the actions for which ${R_{i,t-1}\leq0}$ receive zero weight in round ${t}$, which means non-zero weights are assigned only to actions which performs better than the algorithm. This is the key factor in this algorithm.

The analysis of this Normal-Hedge can be summerized as the following theorem.

Theorem 4 If Normal-Hedge has access to ${N}$ actions, then, as ${t\rightarrow \infty}$, the regret of Normal-Hedge to the top ${\epsilon}$-quantile of actions approaches an upper bound of

$\displaystyle \sqrt{3t(1+\ln{\frac{1}{\epsilon}})+o(t)}.$

In praticular, the regret of Normal-Hedge to the best action approaches an upper bound of

$\displaystyle \sqrt{3t(1+\ln{N})+o(t)}.$

Theorem 4 shows that the leading term of the regret bound with repect to ${\epsilon}$-quantile is of the order of ${\sqrt{n\ln{\frac{1}{\epsilon}}}}$ (or ${\sqrt{T\ln{\frac{1}{\epsilon}}}}$).

The proof of theorem 4 is supported by several lemmas. The general idea is as follows. First we try to relate the performance of the algorithm at time ${t}$ to the scale ${c_t}$. The result is shown in lemma 5 below.

Then we bound the growth of the scale ${c_t}$ as a function of the time ${t}$. As ${c_t}$ increases monotonically with ${t}$, we can divide the rounds ${t}$ into two phases, ${t and ${t\geq t_0}$, where ${t_0}$ is choosen very carefully. The proof then shows bounds on the growth of ${c_t}$ for each phase separately. It can be shown that ${c_t}$ is not too large at the end of the fisrt phase. The per-round growth of ${c_t}$ can be bounded in the second phase, so that it doesn’t increase too fast. The details of the proof is omitted here but can be found here[6].

Lemma 5 At any time ${t}$, the regret to the best action can bounded as:

$\displaystyle \max_iR_{i,t} \leq \sqrt{2c_t(\ln{N}+1)}.$

Moreover, for any ${0\leq\epsilon\leq1}$ and any ${t}$, the regret to the top ${\epsilon}$-quantile of actions is at most

$\displaystyle \sqrt{2c_t(\ln{\frac{1}{\epsilon}}+1)}.$

4. Open problem

Finally we are here. The open problem is that whether there is an algorithm whose regret bound has a leading term of the form ${O(\sqrt{V_n\ln{\frac{1}{\epsilon}}})}$ (possibly with lower order logarithmic terms).

One can interpret Hedge as a potential-based algorithm, in which the potential assigned by Hedge to action ${i}$ is proportional to ${\exp(\eta R_{i,t})}$. It’s clear that Normal-Hedge uses a quite different potential function. One common thing about bound in Normal-Hedge and the bound parameterized using variance is that they all depend on the forecaster’s past predictions. One conjecture is that there is a hedging algorithm for this open problem.

References

1. Freund, Y. (2016). Open Problem: Second order regret bounds parametriezed by variance across actions and top ${\epsilon}$ percentile. JMLR:Workshop and Conference Proceedings vol 49:1-4.
2. Littlestone, N., and Warmuth, M.K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261.
3. Allenberg-Neeman, C., and Neeman B. (2004). Full information game with gains and losses.Algorithmic Learning Theory, 15th International Conference, ALT 2004, Padova, Italy, October 2004, InProceedings, volume 3244 of Lecture Notes in Artiﬁcial Intelligence, pp. 264-278, Springer.
4. Cesa-Bianchi N., Mansour Y., Stoltz G. (2006). Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2):321-352.
5. Auer, P., Cesa-Bianchi, N., and Gentile, C. (2002). Adaptive and self-conﬁdent on-line learning algorithms. Journal of Computer and System Sciences, 64, 48–75.
6. Chaudhuri K., Freund Y., and Hsu D.J. (2009). A parameter-free hedging algorithm. CoRR, abs/0903.2851.
7. Chernov A.V. and Vovk V. (2010). Prediction with advice of unknown number of experts. CoRR, abs/1006.0475.
8. Koolen W.M. and Erven T. (2015). Second-order quantile methods for experts and combinatorial games. CoRR, abs/1502.08009.
9. Luo H. and Schapire. (2015). Achieving all with no parameters: Adaptive normalhedge. CoRR, abs/1502.05934.