Lecture #17: Multiplicative Weights, discussion

1. Adversaries in Multiplicative Weights

Here’s some discussion about the issue of what the adverasary sees before it presents the cost vector. To begin, let’s remove the step where the algorithm actually makes a “prediction”. Instead, the algorithm only maintains a current weight vector {\overline{w}^{(t)}}. And for a cost vector {\overline{m}^{(t)}} presented by the adversary, define the algorithm’s cost to be

\displaystyle  \overline{p}^{(t)} \cdot \overline{m}^{(t)},

where

\displaystyle  {p}_i^{(t)} = \frac{{w}_i^{(t)}}{\sum_{j} {w}_j^{(t)}} = \frac{{w}_i^{(t)}}{\Phi^{(t)}}.

There is no longer any randomization in the algorithm; it is completely deterministic.

Now, what is the adversary allowed to base the cost vector {\overline{m}^{(t)}} on? We allow the adversary to base {\overline{m}^{(t)}} on the entire history; it can depend on {\overline{w}^{(1)}, \overline{w}^{(2)}, \ldots, \overline{w}^{(t)}}, and in particular, the current weights maintained by the algorithm. But we claim that knowing this information does not give the adversary any additional power, since each of these vectors are deterministic functions of the previous cost functions {\overline{m}^{(1)}, \overline{m}^{(2)}, \ldots, \overline{m}^{(t-1)}} presented by the adversary.

To examine the argument a bit more, consider the following adversaries:

  1. (Adversary Type 1) The adversary writes down a sequence of {T} cost vectors {\overline{m}^{(1)}, \overline{m}^{(2)}, \ldots, \overline{m}^{(T)}} up-front, and we run the (deterministic) MW algorithm on it. For each such sequence of cost vectors, and for each expert {i}, we have the guarantee that

    \displaystyle    \sum_t \overline{p}^{(t)} \cdot \overline{m}^{(t)} \leq \sum_t {m}_i^{(t)} + \text{regret}(T,\varepsilon,N).    \ \ \ \ \ (1)


    where I’ve bundled all the additive terms into this “regret” term.

  2. (Adversary Type 2) For each {t}, the adversary is “adaptive” but still deterministic: the new cost vector {\overline{w}^{(t)}} it gives is a deterministic function of the entire history so far. As we said, this entire history is just a (deterministic) function of {\overline{m}^{(1)}, \overline{m}^{(2)}, \ldots, \overline{m}^{(t-1)}}. So for any fixed adversary (which is completely determined by what this map from the history to the next cost vector is), the sequence of cost vectors it gives is some fixed sequence. So if we have a guarantee for each sequence (as above), we get that for any adversary and any expert {i}, inequality~(2) holds for the MW algorithm.

    So a Type 2 adversary is no more powerful than that of Type 1.

  3. (Adversary Type 3) For each {t}, the adversary is “adaptive” and also randomized: the new cost vector {\overline{w}^{(t)}} it gives at time {t} could be a random variable, and also depend on the entire history so far (which itself is a random variable). Note that the algorithm is still deterministic: it flips no coins. Given such an adversary, we could look at the probability space of all length-{T} sequences of cost vectors {\overline{m}^{(1)}, \overline{m}^{(2)}, \ldots, \overline{m}^{(T)}} with the probability of any such length-{T} sequence being the probability it is generated by running MW against this adversary. But inequality~(2) guarantees that for each length-{T} sequence of cost vectors, we are not much worse than the best expert {i}. Hence we get

    \displaystyle    \sum_t \mathbb{E} [\overline{p}^{(t)} \cdot \overline{m}^{(t)}] \leq \sum_t \mathbb{E}[   {m}_i^{(t)}] + \text{regret}(T,\varepsilon,N).    \ \ \ \ \ (2)


    where the expectations is taken over the randomness of the adversary.

    The contrapositive of this statement says: if we have an adversary where the expected regret is high, there must be a fixed length-{T} sequence of cost vectors where the regret is high. So, even an adversary of Type 3 is no more powerful than one of Type 1.

So you can indeed think of the adversary as choosing the cost vector {\overline{m}^{(t)}} depending on the entire history. Or as just writing down a {T}-length sequence in advance. It’s the same.

1.1. Predictions

Finally, what about the fact that the MW algorithm (as specified in lecture) was also making random predictions? The fact that the future decisions of the algorithm did not depend on these random predictions, and that the adversary does not see the prediction before it creates the cost vector, allows us to push the same argument through.

The easiest way to argue this may be to imagine that it’s not the algorithm that makes the predictions. The algorithm instead gives the vector {\overline{p}^{(t)}} to the adversary, who generates a random sample from the distribution himself. And to an external observer, the distributions of the predictions remain exactly the same, and so does their expected cost. But internally, we’ve just reduced to the case of the Type 3 randomized adversary, which we just argued about.

2. John’s Example

John asked a good (and illustrative) question: what about the adversary looks at the current {\overline{p}^{(t)}}, chooses the index {j^*} which has maximum {{p}_j^{(t)}} (say the lowest numbered index, if this is not unique), and defines {\overline{m}^{(t)}} with {m_{j^*}^{(t)} = 1} and {m_j^{(t)} = 0} for {j \neq j^*}. Hence, for this setting, we pay at least {1/N} at every step!

So the {T}-step cost would be {T/N}, which may be much more than {\varepsilon T + \frac{\ln N}{\varepsilon}} if {\varepsilon \ll 1/N}. What’s happening here?

The saving grace is that even the best expert cannot have low cost: MW will ensure that all the experts will end up paying a non-trivial amount. Indeed, how does the vector {\overline{w}^{(t)}} evolve in this case? Starts off at the vector

\displaystyle  (1, 1,1, \cdots 1).

Then it moves to

\displaystyle  (e^{-\varepsilon}, 1,1, \cdots 1).

And then to

\displaystyle  (e^{-\varepsilon}, e^{-\varepsilon},1, \cdots 1).

After {N} steps, we are at

\displaystyle  (e^{-\varepsilon}, e^{-\varepsilon},e^{-\varepsilon}, \cdots   e^{-\varepsilon}).

And so to

\displaystyle  (e^{-2\varepsilon}, e^{-\varepsilon},e^{-\varepsilon}, \cdots e^{-\varepsilon}).

But you see the pattern: any fixed expert {i} incurs a cost of {1} every {N} steps. Which means after {T} steps, each expert has incurred cost in {\{ \lfloor T/N \rfloor, \lceil T/N \rceil \}}.

And what is our cost? Even if the weight vector {\overline{w}^{(t)}} were as lopsided as

\displaystyle  (e^{-\varepsilon}, e^{-\varepsilon}, e^{-\varepsilon}, \cdots, e^{-\varepsilon}, 1).

the expected cost using the cost vector {(0,0,\cdots, 0, 1)} is approximately (for {\varepsilon \leq 1})

\displaystyle  \frac{1}{1 + (N-1)e^{-\varepsilon}} = \frac{e^\varepsilon}{e^{\varepsilon} + (N-1)} \leq \frac{ 1 + \varepsilon + \varepsilon^2}{N}

So our {T}-step total cost is at most

\displaystyle  \frac{ 1 + \varepsilon + \varepsilon^2}{N} \times T \leq \frac{T}{N} + \frac{2\varepsilon}{N} \cdot T.

Assuming that {N \geq 2} (else it’s trivial), this gives

\displaystyle  \text{MW's cost} \leq \sum_t {m}_i^{(t)} + \varepsilon T.

So all is OK.

2.1. Another Example

But what if there were some hidden {j'} that the adversary never gives a {1} in this position {j'}? Instead he chooses the position {j \neq j'} with maximum {{p}_j^{(t)}}. In this case the best expert (namely {j'}) has total cost {0}. So it better be the case that our cost is at most {\varepsilon T + (\log N)/\varepsilon}. It’s a simple exercise to see it is indeed the case.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: