Millions of miles away, the Curiosity rover is going through an exhaustive battery of tests to make sure everything’s kosher before embarking on what will almost surely be the most thorough and informative Mars exploration to date.
In case I’ve been too subtle: yes, it’s been a ball-busting couple of weeks in the ‘burgh. Work has been extremely nose-to-the-grindstone, though I’m optimistic about how the impending fall semester will pan out. Nevertheless, here is a glimpse into the work that’s been keeping me up late:
Ok but really: I’ve been working on implementing a shiny new semi-supervised clustering algorithm, the basic idea being that we know what some of the data is, but the vast majority of it is of an unknown nature. In the case of the images I’m using to test it: can I separate the foreground from the background? The catch here is that I’m giving the algorithm the entire image, but only a handful of pixels are labeled as actual “foreground” or “background” pixels; it has to figure out the rest on its own.
What’s super cool about machine learning with images is that you get stuff like this:
Those familiar with this sort of thing will likely recognize the above images as eigenvectors of the original image. For any sort of spectral analysis where graphs are involved, finding eigenvectors is pretty much par for the course. It’s how you use the eigenvectors that makes each algorithm distinct. I’ve been working on implementing this one, ideally smoothing its implementation to the point where I can submit it to these guys.
And then put it into Mahout. But that’s still a ways away (and I have other work to do there that’s more pressing than adding new algorithms).
Ultimately, I get this:
It’s not ideal, particularly since the few labeled pixels I handed off to the algorithm included the bowl as “background”, and yet the algorithm decided it knew better than I did and included it as foreground. So there are still some things to iron out.
But this is kinda neat!
We’re hoping to get a paper out in the next couple weeks on this. Sadly, it won’t involve cats; rather, our goal is to deploy it on small molecules. While I’m not structural biology’s biggest fan (read: I leave the seminar when I learn it’s on structural biology), this particular application is really interesting to me.
And yes, one of my to-dos is to optimize the algorithm a bit more, so it can handle images that are bigger than little thumbnails. As in most spectral algorithms, one needs to create a graph matrix that is #pixels-by-#pixels. For 50×50 images (like the little cat above), this is only 2500 total pixels, resulting in a graph matrix that’s 2500×2500. That’s only 6.2 million numbers, which will fit in less than 1GB of memory. But for even slightly larger images–say, 640×480–now the number of pixels is 307k, resulting in a graph matrix that holds more than 9.4 billion numbers. Most home computers right now only have 4 or 8GB of memory, and would instantly freeze when attempting to create the graph matrix.
Other exciting news? September is almost here, which meaaaaaans:
I am trying to get back on a regular posting regimen. I’m thinking once a week on the same day each week. How do Tuesdays sound?
When the stress levels get too high, I just look at this picture. Enjoy!