**Second order regret bounds parametrized by variance across actions and top percentile**

This blog will describe an open problem posted by Yoav Freund[1]. We’ll introduce the problem setting and an overview of regret bounds in section 1. In section 2, we’ll define variance across actions and how it comes into play. Then we’ll introduce percentile in section 3. Section 4 will put forward the open problem.

**1. Introduction **

This study focuses on extrenal regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action.

** 1.1. Problem setting **

The forecasting game concerned here is played in rounds. At each time step , the forecaster computes an assignment

which are probabilities over the actions. Then the payoff vector for time is revealed:

The forecaster’s reward is

We define the cumulative reward of the forecaster by

and the cumulative payoff of action by

For all , let be the cumulative payoff of the best action up to time . The goal is to keep the as small as possible uniformly over .

** 1.2. An overview of regret bounds **

There are several kinds of regret bounds with repect to the forecasting games as follows:

**Zero-order bounds:** we say that a bound is of order zero whenever it only depends on bounds on the payoffs (or on the payoff ranges) and on the number of time step . The basic version of the exponentially weighted average forecaster of Littlestone and Warmuth[2] ensures the order of magnitude of the regret is

where is the bound on the payoffs:

Actually, the factor can be replaced by a bound on the *effective ranges* of the payoffs, defined by

To reach this basic version of the regret bound, it assumes that we have prior knowledge of both and (or ), so as to tune the learning rate.

**First-order bounds:** a bound is first order whenever its main term depends on a sum of payoffs. We can define

which is the sum of the obsolute values of the payoffs of the best expert for rounds . Note that the payoffs could be positive or negative. Allenberg-Neeman[3] gave a simple algorithm whose regret is of the order of

Since the , first-order bounds are usually sharper than zero-order bounds.

**Second-order bounds:** a regret bound is second-order whenever its main term is a function of a sum of squared payoffs (or on a quantity that is homogeneous in such a sum). Ideally, they are a function of

Cesa-Bianchi[4] gave a simple algorithm whose regret is of the order of

Since , this bound improves the first-order bounds.

**2. Variance across actions **

If the payoffs are uniformly bounded, and there are infinitely many actions, then there exists simple forecasting strategies whose external regret per time step vanishes irrespective to the choice of the payoff sequence. However, if the payoffs were generated by an independent stochastic process, the tightest rate for the regret with respect to a fixed action should depend on *variance* (rather than the average) of the observed payoffs for that action. Cesa-Bianchi[4] provided an algorithm that makes a step towards it.

**Definitions:** Let be the random variable with range and distribution . Introduce

Hense is the variance of the payoffs at time under the distribution and the cumulative variance is defined as

which is the main second-order quantity used later.

The regret bounds parameterized by variance is derived for the weighted majority forecaster of Littlestone and Warmuth[1] using a time-varing learning rate. Because of the way the learning rate is chosen, this bound actually involves the way the algorithm predicts.

The weighted majority forecaster works as follows: it uses the sequence of learning rates assigns at time a probability distribution over the actions defined by and

Note that the quantities may depend on the past payoffs.

Let’s first introduce a potential function

The following lemma bounds the potential function in terms of variance.

Lemma 1For all payoff vectors , all probability distributions , and all learning rates , we have

where is such that . If, in addition, , then

*Proof:* To prove the first inequality, we have

To prove the second one, we use for .

The last step uses inequality .

The following lemma is used to prove the main theorem later. The proof of this lemma is omitted but can be found in [4,5].

Lemma 2Consider any nonincreasing sequence of positive learning rates and any sequence of payoff vectors. Then the weighted majority forecaster run with the sequence satisfies, for any and for any ,

We now introduce time-varing learning rate based on . The sequence is defined as

for , with . Note that depends on the forecaster’s past predictions.

Theorem 3Provide a bound on the payoff ranges is known beforehand, i.e., , the weighted majority forecaster using the time-varing learning rate achieves, for all sequences of payoffs and for all ,

*Proof:* By using Lemma 1 and 2, and setting for the analysis, we have

We now try to bound the second term. Suppose the round that the first time step t when . Note that

so

Also we have and for . It’ll give us for . Since and , we have

So we get

It’s easy to see that when , which means , the regret is bouned by

When , the regret is bounded by

Putting things together, it concludes the proof.

Theorem 3 tells us that the leading term of regret can be bounded as the order of . is the second-order quantity associated to the sequence of payoffs. However, the precise defition of these quantities makes the bounds generally not comparable to the bounds shown in section 1. One would prefer the bound in this section in case of randomized prediction, in which the forecaster draws an action at random according to at every round (), since the bounds in section 1 are more explict in terms of the payoff sequence as they do not involve the way the algorithm predicts.

**3. Top percentile **

Top -quantile was first introduced by Chaudhuri, Freund and Hsu[6] and has been studied by several groups[7,8,9]. It was used to solve the issue with using online-learning in problems when the number of actions() is very large.

In this section, we are considering the cumulative losses instead of payoffs. When the number of actions is very large, the regret to the *best action* is not an effective measure of performance. In some situations, one expects to have a lot of actions that are close to the action with the lowest loss or highest payoff. As these actions also have low losses, measuring performance with respect to a small group of actions that perform well is extremly reasonable.

Here is a notion of -quantile, which is more natural for practical applications. We order the cumulative losses of all actions from lowest to highest and define the *regret* of the forecaster to the top -quantile to be the difference between cumulative loss of the forecaster and the -th element in the sorted list Chaudhuri, Freund and Hsu[6] provided an algorithm, so called Normal-Hedge, which gives the regret to the top -quantile of actions at most

It holds for all and . If we set , we get that the regret to the best action is upper-bounded by .

** 3.1. Setting **

We assume that the losses lie in an interval of length 1. In round , each action incurs a loss , and the forecaster incurs the expected loss:

The forecaster’s instantaneous regret to an action in round is , and the cumulative regret to an action in the first rounds is

So the regret (of the forecaster or learner) to an action is the difference between the forecaster’s cumulative loss and the cumulative loss of that action.

** 3.2. Normal-Hedge **

Normal-Hedge algorithm is based on a potential function

where denotes . The algorithm keeps tracking the cumulative regrets to each action after each round , the algorithm also maintains a scale parameter , which is chosen so that the average of the potential remains constant at :

can be determined with a line search.

The weight assigned to in round is set proportional to the first-derivative of the potential, evaluated at and :

The full algorithm is shown in below:

Initially: Set for each .

For

- Each action incurs loss .
- Forcaster incurs loss .
- Update cumulative regrets: for each .
- Find satisfying .
- Update distribution for round for each .

Notice that the actions for which receive zero weight in round , which means non-zero weights are assigned only to actions which performs better than the algorithm. This is the key factor in this algorithm.

The analysis of this Normal-Hedge can be summerized as the following theorem.

Theorem 4If Normal-Hedge has access to actions, then, as , the regret of Normal-Hedge to the top -quantile of actions approaches an upper bound of

In praticular, the regret of Normal-Hedge to the best action approaches an upper bound of

Theorem 4 shows that the leading term of the regret bound with repect to -quantile is of the order of (or ).

The proof of theorem 4 is supported by several lemmas. The general idea is as follows. First we try to relate the performance of the algorithm at time to the scale . The result is shown in lemma 5 below.

Then we bound the growth of the scale as a function of the time . As increases monotonically with , we can divide the rounds into two phases, and , where is choosen very carefully. The proof then shows bounds on the growth of for each phase separately. It can be shown that is not too large at the end of the fisrt phase. The per-round growth of can be bounded in the second phase, so that it doesn’t increase too fast. The details of the proof is omitted here but can be found here[6].

Lemma 5At any time , the regret to the best action can bounded as:

Moreover, for any and any , the regret to the top -quantile of actions is at most

**4. Open problem **

Finally we are here. The open problem is that whether there is an algorithm whose regret bound has a leading term of the form (possibly with lower order logarithmic terms).

One can interpret Hedge as a potential-based algorithm, in which the potential assigned by Hedge to action is proportional to . It’s clear that Normal-Hedge uses a quite different potential function. One common thing about bound in Normal-Hedge and the bound parameterized using variance is that they all depend on the forecaster’s past predictions. One conjecture is that there is a hedging algorithm for this open problem.

** References **

- Freund, Y. (2016). Open Problem: Second order regret bounds parametriezed by variance across actions and top percentile.
*JMLR:Workshop and Conference Proceedings*vol 49:1-4. - Littlestone, N., and Warmuth, M.K. (1994). The weighted majority algorithm.
*Information and Computation*, 108, 212–261. - Allenberg-Neeman, C., and Neeman B. (2004). Full information game with gains and losses.
*Algorithmic Learning Theory, 15th International Conference, ALT 2004, Padova, Italy, October 2004, InProceedings, volume 3244 of Lecture Notes in Artiﬁcial Intelligence*, pp. 264-278, Springer. - Cesa-Bianchi N., Mansour Y., Stoltz G. (2006). Improved second-order bounds for prediction with expert advice.
*Machine Learning*, 66(2):321-352. - Auer, P., Cesa-Bianchi, N., and Gentile, C. (2002). Adaptive and self-conﬁdent on-line learning algorithms.
*Journal of Computer and System Sciences*, 64, 48–75. - Chaudhuri K., Freund Y., and Hsu D.J. (2009). A parameter-free hedging algorithm.
*CoRR*, abs/0903.2851. - Chernov A.V. and Vovk V. (2010). Prediction with advice of unknown number of experts.
*CoRR*, abs/1006.0475. - Koolen W.M. and Erven T. (2015). Second-order quantile methods for experts and combinatorial games.
*CoRR*, abs/1502.08009. - Luo H. and Schapire. (2015). Achieving all with no parameters: Adaptive normalhedge.
*CoRR*, abs/1502.05934.

good!

LikeLiked by 1 person