In theory, anyway.
If you’re at all interested, I’ve started posting code from my coursework on this github repository. I’ll eventually add stuff from classes I took as a master’s student, and even as an undergrad at Georgia Tech. The code specifically in question is under the 10-605 folder, assignment “hw4a”: Naive Bayes on Hadoop.
I also posted a fairly in-depth technical discussion on my research blog about the basics of the algorithm and how to write the various steps in the pipeline, if you’re interested. I’m making a slightly different point here, mainly:
Why the crap did the code execute just fine on my single machine (laptop), my virtual cluster (4 Ubuntu VMs), and the CMU Andrew cluster, and fail miserably on Amazon?
I did all the development on my rock-solid reliable laptop, and ran basic testing there just to make sure the code would run without crashing. However, with MapReduce applications, there really is no substitute for deploying it on a cluster to make sure it’s actually functional. I built my beefy desktop Ronon with precisely this use-case in mind: with nearly 3TB of disk space over four hard drives, 24GB of memory, and a quad-core i7, it’s built for virtualization.
I set up 4 virtual machines, each instances of Ubuntu 12.10, and configured them to work together as a physical Hadoop cluster would. Of course any Hadoop job would run painfully slower on this “virtual cluster” than it would on a physical 4-machine setup, but the code won’t realize that, and frankly neither will the operating systems; it’s a perfect environment for making sure my MapReduce code really works.
Even better, though, is that I have access to Carnegie Mellon’s (or, more specifically, the department of computer science) computing cluster, a legitimate Hadoop setup for deploying MapReduce applications. In sum, I had access to three layers of testing–my laptop, my virtual cluster, and a physical cluster–before ever reaching Amazon. I figured I’d know for sure when I deployed my code on Amazon that it would work.
With distributed programming, think of a bunch of people working in parallel to solve one big problem by each independently solving smaller (and hopefully, easier) problems. In theory, this should be a lot faster than one person working on the one big problem, right?
I tried to be clever in my application: I made use of one aspect of Hadoop called the DistributedCache. In the example of a bunch of people working in parallel, the cache would be like giving everyone the ability to communicate with each other by email, except they can’t attach files. So their communications have to be small. It’s a very convenient thing to have.
Unfortunately, while it’s not hard to use, the instruction manual is virtually nonexistent.
It took an entire day to track down the problem to the one line–one line–of code that was causing Amazon to hiccup. I suspect it stems from the fact that Amazon has an additional data storage layer (S3) that none of my testing outlets has; hence, when I was trying to read information back, I wasn’t being quite specific enough with regards to where that information was. My virtual cluster and the CMU cluster are simple enough in their setup and configuration to figure out what I meant. Amazon, on the other hand, is such a behemoth that there was literally no alternative than to be perfectly explicit.
On one hand, I’m thrilled that I got it working. On the other…really, Amazon? I test thoroughly, under different versions of Hadoop (1.0.1 through 1.0.4), under different environments (standalone in OS X; virtualized on Ubuntu; full RedHat cluster), and it still fails when I reach a fourth “production” environment on Amazon.
Thankfully, Amazon has a pretty awesome setup, with lots of logging and an intuitive interface, so it’s pretty straightforward to find the error messages and figure out why it crashed in the first place. Even if it takes awhile to find the actual problem, at least it’s relative easy to get your hands on all the information you need.
Aside from the obvious moral of “beware of differences between production and testing environments”, there’s a side story here: keep your bloody documentation current! And I’d be happy to help with that. Just sayin’.
As happy as these guys to be done with that hill behind them: