Don’t look now, but the 2009 rendition of Google Codejam is very close to concluding its qualification round!
The competition opened yesterday at 7pm eastern. I worked with my partner in C++ crime who goes by the moniker of Danimal (we’re high school buddies who have pursued somewhat different fields within computer science, but we still keep in regular touch, and even work on projects together, most notably a burgeoning open source real-time strategy video game…but that’s top secret), and we kicked the living arse out of the qualification round.
For those of you not familiar, Codejam is simply a coding competition sponsored by Google, with a grand prize of $5000. Dan and I missed the announcement last year, so we’ve basically been planning our entire year around today to make sure we didn’t miss it again. 😛
The competition consists of several rounds, the first of which is a qualification round. There are three programming questions – often more algorithmically intensive than programming intensive – and the goal is to solve as many of the problems as quickly as possible. For the purposes of the qualification round, we were only required to solve one question correctly in order to move on to Round 1 (scheduled for next week).
Here’s how it went down.
Problem A: Alien Language
After years of study, scientists at Google Labs have discovered an alien language transmitted from a faraway planet. The alien language is very unique in that every word consists of exactly L lowercase letters. Also, there are exactly D words in this language.
Once the dictionary of all the words in the alien language was built, the next breakthrough was to discover that the aliens have been transmitting messages to Earth for the past decade. Unfortunately, these signals are weakened due to the distance between our two planets and some of the words may be misinterpreted. In order to help them decipher these messages, the scientists have asked you to devise an algorithm that will determine the number of possible interpretations for a given pattern.
A pattern consists of exactly L tokens. Each token is either a single lowercase letter (the scientists are very sure that this is the letter) or a group of unique lowercase letters surrounded by parenthesis ( and ). For example: (ab)d(dc) means the first letter is either a or b, the second letter is definitely d and the last letter is either d or c. Therefore, the pattern (ab)d(dc) can stand for either one of these 4 possibilities: add, adc, bdd, bdc.
3 5 4
Case #1: 2
Case #2: 1
Case #3: 3
Case #4: 0
We began with this problem (at Dan’s insistence, even though Dan disagreed with me when I said it was the toughest question of the three, and the submission statistics ended up proving me RIGHT…….ass). Conceptually speaking, it’s fairly straightforward – the most challenging portion is representing those multiple “ambiguous letters” as a single character when attempting to determine which words in the dictionary it could possibly be. Parsing those ambiguous letters out and keeping track of the index they belong to was my biggest hurdle. I ended up removing everything between the braces and replacing it with the character ‘0’, thus maintaining the correct string length but also ensuring it wouldn’t be confused later as a correctly assigned character.
Dan and I both successfully solved this problem, both the small and large data sets.
Problem B: Watersheds
Geologists sometimes divide an area of land into different regions based on where rainfall flows down to. These regions are called drainage basins.
Given an elevation map (a 2-dimensional array of altitudes), label the map such that locations in the same drainage basin have the same label, subject to the following rules.
* From each cell, water flows down to at most one of its 4 neighboring cells.
* For each cell, if none of its 4 neighboring cells has a lower altitude than the current cell’s, then the water does not flow, and the current cell is called a sink.
* Otherwise, water flows from the current cell to the neighbor with the lowest altitude.
* In case of a tie, water will choose the first direction with the lowest altitude from this list: North, West, East, South.
Every cell that drains directly or indirectly to the same sink is part of the same drainage basin. Each basin is labeled by a unique lower-case letter, in such a way that, when the rows of the map are concatenated from top to bottom, the resulting string is lexicographically smallest. (In particular, the basin of the most North-Western cell is always labeled ‘a’.)
9 6 3
5 9 6
3 5 9
0 1 2 3 4 5 6 7 8 7
7 6 7
7 6 7
1 2 3 4 5
2 9 3 9 6
3 3 0 8 7
4 9 8 9 8
5 6 7 8 9
8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8
a b b
a a b
a a a
a a a a a a a a a b
a a a
b b b
a a a a a
a a b b a
a b b b a
a b b b a
a a a a a
a b c d e f g h i j k l m
n o p q r s t u v w x y z
Personally, this was my favorite question, in no small part because it reminded me of the machine learning watershed method of fluorescence microscopy image segregation 😛 . It’s much less difficult than it actually reads at first, and the submission statistics put this one’s difficulty on par with the last question, and easier than A. The hardest part of this problem is the very last step – determining what sink each cell in the basin ultimately drains into.
Originally, I was approaching it from a similar direction as a dynamic programming protein alignment problem I had written back in the spring – each cell keeps an array of references to all cells that point to it, so that backtracking is made easier at the expense of more initial overhead. However, Dan made the astute observation that, since each cell is already storing a single reference to the cell it flows into (unless the cell is a sink), why not also store a single reference in each cell to its ultimate sink?
This essentially became traversing a linked list, iterating over the “next” references until we reached the sink, and then assigned that to the current cell. From there, labeling each cell was a snap.
Dan and I both correctly solved this problem as well, both the small and large data sets.
Problem C: Welcome to Code Jam
So you’ve registered. We sent you a welcoming email, to welcome you to code jam. But it’s possible that you still don’t feel welcomed to code jam. That’s why we decided to name a problem “welcome to code jam.” After solving this problem, we hope that you’ll feel very welcome. Very welcome, that is, to code jam.
If you read the previous paragraph, you’re probably wondering why it’s there. But if you read it very carefully, you might notice that we have written the words “welcome to code jam” several times: 400263727 times in total. After all, it’s easy to look through the paragraph and find a ‘w’; then find an ‘e’ later in the paragraph; then find an ‘l’ after that, and so on. Your task is to write a program that can take any text and print out how many times that text contains the phrase “welcome to code jam”.
To be more precise, given a text string, you are to determine how many times the string “welcome to code jam” appears as a sub-sequence of that string. In other words, find a sequence s of increasing indices into the input string such that the concatenation of input[s], input[s], …, input[s] is the string “welcome to code jam”.
The result of your calculation might be huge, so for convenience we would only like you to find the last 4 digits.
elcomew elcome to code jam
wweellccoommee to code qps jam
welcome to codejam
Case #1: 0001
Case #2: 0256
Case #3: 0000
This one was interesting. At first glance, it is nothing more than the classic “longest common subsequence” problem. It even lent itself to a very elegant recursive solution of my own design, for a total code base of a mere 90 lines.
The idea is pretty simple. You have two strings: a “search” string (the strings on the left in the above sample), and the “find” string (“welcome to code jam”). Starting at the beginning of both, run through the search string until a matching character is found. Ensure all the other characters in the find string are present in the correct order in the find string. Provided that’s true, you have 1 valid subsequence. Repeat, matching that first character of the find string until you reach the end of the search string (exercise for the reader: there’s an opportunity to optimize here!). Then move on to the second character in the find string, and repeat the whole process again, until the very last character in the find string is matched.
My recursive method had a base case of when the find string is empty – this means we have reached its end and matched every single character in it, so we have a valid subsequence – and returned 1. The recursive case was a loop starting at the beginning of both strings, looking for a matching character in the search string to the find string’s first character. If a match was found, then recurse, passing in substrings of both: everything in the search string after the matched character, and everything in the find string minus the very first character.
The algorithm was all of 15 lines; the other 75 lines were file I/O to read in the test cases and print them out. It worked perfectly on the sample input, and the small input file.
…woe was me when I tested it on the large input. A line with 500 characters, most of which were of the form “wwwwwwwweeeeeeeeeeelllllllllllllccccccccccooooooommmmmeeeeeee …” (use combinatorics to figure out how obscenely massive the number of subsequences would be). As best I can tell, my elegant recursive solution blew its stack – it couldn’t even finish the first test case of about 100.
Dan, too, had the same problem. We both solved the small input, but timed out on the large one. Needless to say, the overhead of recursion in this case proved too much – an iterative solution is what is required.
With only minutes left in the qualification round, it’s unlikely anyone could use these insights to their advantage. 😛 By answering one short input correctly (we got all three) and one large input correctly (those aren’t checked until after the round ends, but we’re pretty confident on the two we got), one is passed on to Round 1, where another batch of problems will be posed. This time around, however, the total points for correct answers and the time taken to achieve those correct answers will factor into who moves on to Round 2.
Dan and I weren’t working under any time constraints; we thoroughly worked each problem, stopping to bounce ideas off each other, waiting for the other to catch up, taking breaks, and also goofing around quite a bit, so we clocked in around several hours total. The current first place contestant finished all three problems in twenty-five minutes.
Of course, time starts from the moment one submits a solution to one of the problems, so this person probably worked all three questions and submitted them at the same time…but still. That’s beastly.
Man I love this stuff 😀 Kudos to Google for organizing Codejam again!