I just completed an assignment for my machine learning class (which, in case you haven’t picked up from my lack of posting frequency, is a little challenging), for which I had a write a program that implements the concept of boosting, specifically AdaBoost.
For those like me who balk initially at the sight of mathematical formulas, here’s the up-shot of how boosting works.
Let’s say you’re getting ready for a big date: you met an awesome young woman and are nervous about your first evening out together. You’re a dude, so you inherently (or at least, probably) suck at picking out decent outfits. So you bring in the big guns: a veritable “panel” of your closest friends. Surely, all of them put together can hash out something reasonable?
Except, common sense should already tell you that some of your friends will be better at telling you which outfits are “better” than others. Problem is, you have no idea which ones. And on top of that, some of your friends may “lean” towards one style or another; maybe your guy friends are really good at picking out casual wear for you, but your lady friends are much more proficient at getting you dressed up (as an example).
So you need a way of, essentially, “weighting” the contribution of each of your friends’ input on an outfit, such that their collective decisions will yield something as close as possible to the true, objective “awesomeness” of your outfit. How should you do this?
- Before bringing in your friends, pick out a few outfits that you know beyond any doubt are either AWFUL, or BRILLIANT (maybe your mom told you, or something).
- Bring in your 10 friends.
- One at a time, put on the outfits and have your friends weigh in.
- For outfit #1, each contribution by your friends is weighted equally. Thus, each friend’s input constitutes 10% of the “total” answer. If 9 friends vote “yay” and 1 “nay”, you can be 90% sure it’s a solid outfit. And vice versa.
- However, for that first outfit, you already know ahead of time whether it’s “yay” or “nay”. Thus, you know which of your friends are wrong. For those who are wrong, decrease their overall contributory weight in such a way that, if you present a similar outfit in the future, you’ll decrease their contribution only for similar outfits.
- Repeat until you’ve tried on all the outfits you set aside.
Now, you have a bunch of weights assigned to each of your friends. When you try on various outfits for which you have no idea, you can be fairly confident (with a level of confidence directly proportional to how many outfits you tried on in the steps above) that the outfit you eventually settle on will, in fact, impress your lady friend (or at least guarantee that your Hawaiian shirt and Uggs aren’t what she remembers most clearly about the evening).
That, in essence, is boosting. That’s what I had to do. Unfortunately, I had a small bug somewhere in my code…
def iterate(data, labels, weights): ''' Performs a single iteration of boosting, using the given data, its corresponding labels, and the weights of each data point. ''' # set up all the variables we'll need for this operation L = A = D = J = 0 # loop through all the features - O(p) for t in range(0, data.shape): # extract the feature feature = data[..., t] # sort the features by sorting their keys - O(n log n) sorted = np.argsort(feature) # if this is the first feature, we need to count up the degree of separation # count the positive and negative labels posAbove = posBelow = negAbove = negBelow = 0 for index in sorted: # - O(n) if labels[index] > 0: posBelow += weights[index] else: negBelow += weights[index] # for this current data point, check on its label and update # the mixture of labels above and below to judge how homogenous # the separation is based on splitting by this single point n = len(feature) for i in range(0, n): # - O(n) # split on cutoff and calculate the error index = sorted[i] cutoff = feature[index] # move this current point's label to the Above pool, since the # cutoff is of the form "less than or equal to" if labels[index] < 0: negBelow -= weights[index] negAbove += weights[index] else: posBelow -= weights[index] posAbove += weights[index] # now: how many errors would splitting on "cutoff" make? # first let's assign the top to be +, bottom to be - # (corresponds to d = 1) d = 1 accuracy = posAbove + negBelow error = negAbove + posBelow # was this a good cutoff? if error > accuracy: temp = error error = accuracy accuracy = temp d = -1 # is our current accuracy better than our best? if L < accuracy: L = accuracy A = cutoff D = d J = t # return the index for which we had the best accuracy return (A, D, J, error)
Yeahhh, and this is just a chunk (albeit the majority) of my code. Specifically, it’s the part that, given your current weighting system, and given that you’re trying on the outfits for which you already know whether they’re good or bad, figures out the most accurate assignment of people to outfits.
My problem was this: let’s say I had 250 friends standing around me (I’d get a pretty accurate answer, at least in theory). For some reason, my program still wasn’t giving me 100% accuracy on the outfits for which I already knew were either good or bad; the error rates were skyrocketing very early on and I couldn’t figure out why.
After a couple days of pulling out my hair, I finally found it in the very, very last line:
return (A, D, J, error)
That final line should’ve read:
return (A, D, J, (1 - L))
So much trouble, for 5 measly characters.
More later – gotta get back to ApacheCon!