Check out the comments in this post for fixes. In short, they involve question 1 (the should be ), and question 5b (the expression should be ).
The course evaluations are now open off this page.
I realize they take a little time, but they are super-useful: given the course is a brand-new one, we could really use your feedback to see what we did right and what we didn’t, and to change things around as needed.
For those of you who weren’t in class on Tuesday: Homework #6 is on the webpage. The number of problems you must solve is based on the number of times you scribe for the course.
A draft of the Multiplicative Weights lecture notes has been posted.
is Tuesday Nov 15th. (Not Thursday Nov 15th, as listed earlier.)
1. Adversaries in Multiplicative Weights
Here’s some discussion about the issue of what the adverasary sees before it presents the cost vector. To begin, let’s remove the step where the algorithm actually makes a “prediction”. Instead, the algorithm only maintains a current weight vector . And for a cost vector presented by the adversary, define the algorithm’s cost to be
There is no longer any randomization in the algorithm; it is completely deterministic.
Now, what is the adversary allowed to base the cost vector on? We allow the adversary to base on the entire history; it can depend on , and in particular, the current weights maintained by the algorithm. But we claim that knowing this information does not give the adversary any additional power, since each of these vectors are deterministic functions of the previous cost functions presented by the adversary.
To examine the argument a bit more, consider the following adversaries:
- (Adversary Type 1) The adversary writes down a sequence of cost vectors up-front, and we run the (deterministic) MW algorithm on it. For each such sequence of cost vectors, and for each expert , we have the guarantee that
where I’ve bundled all the additive terms into this “regret” term.
- (Adversary Type 2) For each , the adversary is “adaptive” but still deterministic: the new cost vector it gives is a deterministic function of the entire history so far. As we said, this entire history is just a (deterministic) function of . So for any fixed adversary (which is completely determined by what this map from the history to the next cost vector is), the sequence of cost vectors it gives is some fixed sequence. So if we have a guarantee for each sequence (as above), we get that for any adversary and any expert , inequality~(2) holds for the MW algorithm.
So a Type 2 adversary is no more powerful than that of Type 1.
- (Adversary Type 3) For each , the adversary is “adaptive” and also randomized: the new cost vector it gives at time could be a random variable, and also depend on the entire history so far (which itself is a random variable). Note that the algorithm is still deterministic: it flips no coins. Given such an adversary, we could look at the probability space of all length- sequences of cost vectors with the probability of any such length- sequence being the probability it is generated by running MW against this adversary. But inequality~(2) guarantees that for each length- sequence of cost vectors, we are not much worse than the best expert . Hence we get
where the expectations is taken over the randomness of the adversary.
The contrapositive of this statement says: if we have an adversary where the expected regret is high, there must be a fixed length- sequence of cost vectors where the regret is high. So, even an adversary of Type 3 is no more powerful than one of Type 1.
So you can indeed think of the adversary as choosing the cost vector depending on the entire history. Or as just writing down a -length sequence in advance. It’s the same.
Finally, what about the fact that the MW algorithm (as specified in lecture) was also making random predictions? The fact that the future decisions of the algorithm did not depend on these random predictions, and that the adversary does not see the prediction before it creates the cost vector, allows us to push the same argument through.
The easiest way to argue this may be to imagine that it’s not the algorithm that makes the predictions. The algorithm instead gives the vector to the adversary, who generates a random sample from the distribution himself. And to an external observer, the distributions of the predictions remain exactly the same, and so does their expected cost. But internally, we’ve just reduced to the case of the Type 3 randomized adversary, which we just argued about.
2. John’s Example
John asked a good (and illustrative) question: what about the adversary looks at the current , chooses the index which has maximum (say the lowest numbered index, if this is not unique), and defines with and for . Hence, for this setting, we pay at least at every step!
So the -step cost would be , which may be much more than if . What’s happening here?
The saving grace is that even the best expert cannot have low cost: MW will ensure that all the experts will end up paying a non-trivial amount. Indeed, how does the vector evolve in this case? Starts off at the vector
Then it moves to
And then to
After steps, we are at
And so to
But you see the pattern: any fixed expert incurs a cost of every steps. Which means after steps, each expert has incurred cost in .
And what is our cost? Even if the weight vector were as lopsided as
the expected cost using the cost vector is approximately (for )
So our -step total cost is at most
Assuming that (else it’s trivial), this gives
So all is OK.
2.1. Another Example
But what if there were some hidden that the adversary never gives a in this position ? Instead he chooses the position with maximum . In this case the best expert (namely ) has total cost . So it better be the case that our cost is at most . It’s a simple exercise to see it is indeed the case.
Lecture notes on the Canonical SDP for CSPs are published.
Here is a draft of the lecture notes on The Canonical LP for Constraint Satisfaction Problems.
Homework 3 has been graded. The average was 16. Some comments (the detailed solutions are on the webpage, with the solution to problem 1 to be posted tomorrow).
1. Most of you got this one right. The reason why the binary search solution was not strongly poly-time was that the number of steps depends on the size of the numbers involved.
2. For this one, a few solutions tried proving it from first principles. All these statements can be proved in that way, but the easier way is to prove one version of the Farkas lemma (which you’ve already done in HW2) and then reduce these other versions to that one.
3. No surprises here, nothing to report.
4. Perhaps a majority of the solutions reduced it to standard matching and used Hall/Konig’s theorem to prove this statement. Perfectly kosher, even though I was hoping for the duality-based solution.
We give back the graded HWs 2 and 3 at the end of lecture tomorrow.