My desktop, Ronon, is a beast (yes, its namesake is exactly what you think it is): twin Radeon 4890 video cards, terabytes of hard drive space (will eventually be adding an SSD to the mix; that’s the next major upgrade), 18GB of memory, and a quad-core i7 CPU. I designed it specifically to have horsepower to spare for whatever graduate school could throw my way.
And now that my research is turning into the arena of distributed computing, it’s coming in handy: a VirtualBox setup running four instances of Ubuntu, one as a master node, and three as slaves, constituting a proof-of-concept Hadoop cluster.
There’s the primary Hadoop website with lots of great documentation, plus the mailing lists to the Apache developers themselves. But I found some amazing tutorials that go step-by-step through a fairly typical / basic setup, just to get a cluster up and running.
A few months ago, I followed the multi-node tutorial to get Hadoop running on the aforementioned setup: each of the four [virtual] nodes would have 2GB of memory plus 20GB of hard disk space, and 1 CPU. Obviously the performance gains here will be marginal; after all, it’s still on one physical machine. But the point here was to get something that could pass as a Hadoop cluster up and running so I could do some preliminary MapReduce work. It only took a few hours, and I was up and running.
Until October. Why October? It’s when the new version of Ubuntu–11.10–was released.
All of a sudden, the Datanodes (Hadoop jargon for the slaves, or “workhorses”) couldn’t communicate with the Namenode (jargon for the master node). I’d get entries upon entries in the logs of “Retrying connect to server…” with no success.
Not being particularly well-versed in Hadoop, I set about trying to find out what could be causing the error I was seeing. So I Google’d the error message.
Didn’t find a whole lot, other than the Hadoop wiki giving me some basic troubleshooting tips. Unfortunately these did not solve the connectivity problem: I played around with all the settings (if you read the tutorials, there really aren’t many), to no avail.
My next stop: StackOverflow.
I love this website. Seriously. The people who frequent it are so bloody smart, and with nary an exception, I’ve always either found the answer on my own from some suggestions posted, or someone knew the answer outright. It’s a fantastic site.
Anyway. I posted this question. My initial responses confirmed what I had been thinking to do, but couldn’t quite justify given the evidence: that a firewall was blocking the slaves from seeing the master. There were a few problems with this hypothesis:
- I had no firewall running on any of my Ubuntu instances.
- I could SSH from one VM to another (both ways), and the slaves could view the master’s website.
After installing a firewall, explicitly disabling it, and seeing no difference, I finally had the idea to try out nmap. I installed it one of the slaves, fired up Hadoop, and scanned the master.
nmap -v -sU -p 54310,54311 192.168.1.10
The ports were definitively listed as “closed”. Well, that would explain the symptoms, but it didn’t help in trying to figure out what was wrong.
Another point of interest was “netstat” on the master. I would run it every time I fired up Hadoop just to make sure that the right ports were listening. Sure enough, 54310 and 54311 (the ports that the slaves were complaining weren’t open) were bound by the master and set to listen. It showed other ports as well–50030, 50090, and the other web status ports–as open, and these were accessible from browsers in the slaves.
Why weren’t the slaves–or nmap–seeing these apparently open ports??
I looked closer at the netstat output.
…wait a minute…
I went into Ubuntu’s “/etc/hosts” file, deleted the offending entry, and everything started working just fine. The crux: when I updated from 11.04 to 11.10, Ubuntu added the new entry to the hosts file automatically, redirected the hostname (which I was using in the Hadoop configuration) to the local IP address, i.e. an address that external IPs couldn’t access. Hence why the ports looked open from the master, but closed from the slaves.
So there you have it. Months of pulling my hair out (though only because it was a somewhat low priority compared to…everything else), and the solution was to delete a single line from a single file. Literally: the addition of one character solved this problem.