Code! Jam!

Posted on September 3, 2009 by Shannon Quinn

Don’t look now, but the 2009 rendition of Google Codejam is very close to concluding its qualification round!

The competition opened yesterday at 7pm eastern. I worked with my partner in C++ crime who goes by the moniker of Danimal (we’re high school buddies who have pursued somewhat different fields within computer science, but we still keep in regular touch, and even work on projects together, most notably a burgeoning open source real-time strategy video game…but that’s top secret), and we kicked the living arse out of the qualification round.

For those of you not familiar, Codejam is simply a coding competition sponsored by Google, with a grand prize of $5000. Dan and I missed the announcement last year, so we’ve basically been planning our entire year around today to make sure we didn’t miss it again. 😛

The competition consists of several rounds, the first of which is a qualification round. There are three programming questions – often more algorithmically intensive than programming intensive – and the goal is to solve as many of the problems as quickly as possible. For the purposes of the qualification round, we were only required to solve one question correctly in order to move on to Round 1 (scheduled for next week).

Here’s how it went down.

Problem A: Alien Language

After years of study, scientists at Google Labs have discovered an alien language transmitted from a faraway planet. The alien language is very unique in that every word consists of exactly L lowercase letters. Also, there are exactly D words in this language.

Once the dictionary of all the words in the alien language was built, the next breakthrough was to discover that the aliens have been transmitting messages to Earth for the past decade. Unfortunately, these signals are weakened due to the distance between our two planets and some of the words may be misinterpreted. In order to help them decipher these messages, the scientists have asked you to devise an algorithm that will determine the number of possible interpretations for a given pattern.

A pattern consists of exactly L tokens. Each token is either a single lowercase letter (the scientists are very sure that this is the letter) or a group of unique lowercase letters surrounded by parenthesis ( and ). For example: (ab)d(dc) means the first letter is either a or b, the second letter is definitely d and the last letter is either d or c. Therefore, the pattern (ab)d(dc) can stand for either one of these 4 possibilities: add, adc, bdd, bdc.

Input Output

3 5 4 abc bca dac dbc cba (ab)(bc)(ca) abc (abc)(abc)(abc) (zyx)bc Case #1: 2 Case #2: 1 Case #3: 3 Case #4: 0

We began with this problem (at Dan’s insistence, even though Dan disagreed with me when I said it was the toughest question of the three, and the submission statistics ended up proving me RIGHT…….ass). Conceptually speaking, it’s fairly straightforward – the most challenging portion is representing those multiple “ambiguous letters” as a single character when attempting to determine which words in the dictionary it could possibly be. Parsing those ambiguous letters out and keeping track of the index they belong to was my biggest hurdle. I ended up removing everything between the braces and replacing it with the character ‘0’, thus maintaining the correct string length but also ensuring it wouldn’t be confused later as a correctly assigned character.

Dan and I both successfully solved this problem, both the small and large data sets.

Problem B: Watersheds

Geologists sometimes divide an area of land into different regions based on where rainfall flows down to. These regions are called drainage basins.

Given an elevation map (a 2-dimensional array of altitudes), label the map such that locations in the same drainage basin have the same label, subject to the following rules.

* From each cell, water flows down to at most one of its 4 neighboring cells.
* For each cell, if none of its 4 neighboring cells has a lower altitude than the current cell’s, then the water does not flow, and the current cell is called a sink.
* Otherwise, water flows from the current cell to the neighbor with the lowest altitude.
* In case of a tie, water will choose the first direction with the lowest altitude from this list: North, West, East, South.

Every cell that drains directly or indirectly to the same sink is part of the same drainage basin. Each basin is labeled by a unique lower-case letter, in such a way that, when the rows of the map are concatenated from top to bottom, the resulting string is lexicographically smallest. (In particular, the basin of the most North-Western cell is always labeled ‘a’.)

Input Output

5 3 3 9 6 3 5 9 6 3 5 9 1 10 0 1 2 3 4 5 6 7 8 7 2 3 7 6 7 7 6 7 5 5 1 2 3 4 5 2 9 3 9 6 3 3 0 8 7 4 9 8 9 8 5 6 7 8 9 2 13 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Case #1: a b b a a b a a a Case #2: a a a a a a a a a b Case #3: a a a b b b Case #4: a a a a a a a b b a a b b b a a b b b a a a a a a Case #5: a b c d e f g h i j k l m n o p q r s t u v w x y z

Personally, this was my favorite question, in no small part because it reminded me of the machine learning watershed method of fluorescence microscopy image segregation 😛 . It’s much less difficult than it actually reads at first, and the submission statistics put this one’s difficulty on par with the last question, and easier than A. The hardest part of this problem is the very last step – determining what sink each cell in the basin ultimately drains into.

Originally, I was approaching it from a similar direction as a dynamic programming protein alignment problem I had written back in the spring – each cell keeps an array of references to all cells that point to it, so that backtracking is made easier at the expense of more initial overhead. However, Dan made the astute observation that, since each cell is already storing a single reference to the cell it flows into (unless the cell is a sink), why not also store a single reference in each cell to its ultimate sink?

This essentially became traversing a linked list, iterating over the “next” references until we reached the sink, and then assigned that to the current cell. From there, labeling each cell was a snap.

Dan and I both correctly solved this problem as well, both the small and large data sets.

Problem C: Welcome to Code Jam

So you’ve registered. We sent you a welcoming email, to welcome you to code jam. But it’s possible that you still don’t feel welcomed to code jam. That’s why we decided to name a problem “welcome to code jam.” After solving this problem, we hope that you’ll feel very welcome. Very welcome, that is, to code jam.

If you read the previous paragraph, you’re probably wondering why it’s there. But if you read it very carefully, you might notice that we have written the words “welcome to code jam” several times: 400263727 times in total. After all, it’s easy to look through the paragraph and find a ‘w’; then find an ‘e’ later in the paragraph; then find an ‘l’ after that, and so on. Your task is to write a program that can take any text and print out how many times that text contains the phrase “welcome to code jam”.

To be more precise, given a text string, you are to determine how many times the string “welcome to code jam” appears as a sub-sequence of that string. In other words, find a sequence s of increasing indices into the input string such that the concatenation of input[s[0]], input[s[1]], …, input[s[18]] is the string “welcome to code jam”.

The result of your calculation might be huge, so for convenience we would only like you to find the last 4 digits.

Input Output

3 elcomew elcome to code jam wweellccoommee to code qps jam welcome to codejam Case #1: 0001 Case #2: 0256 Case #3: 0000

This one was interesting. At first glance, it is nothing more than the classic “longest common subsequence” problem. It even lent itself to a very elegant recursive solution of my own design, for a total code base of a mere 90 lines.

The idea is pretty simple. You have two strings: a “search” string (the strings on the left in the above sample), and the “find” string (“welcome to code jam”). Starting at the beginning of both, run through the search string until a matching character is found. Ensure all the other characters in the find string are present in the correct order in the find string. Provided that’s true, you have 1 valid subsequence. Repeat, matching that first character of the find string until you reach the end of the search string (exercise for the reader: there’s an opportunity to optimize here!). Then move on to the second character in the find string, and repeat the whole process again, until the very last character in the find string is matched.

My recursive method had a base case of when the find string is empty – this means we have reached its end and matched every single character in it, so we have a valid subsequence – and returned 1. The recursive case was a loop starting at the beginning of both strings, looking for a matching character in the search string to the find string’s first character. If a match was found, then recurse, passing in substrings of both: everything in the search string after the matched character, and everything in the find string minus the very first character.

The algorithm was all of 15 lines; the other 75 lines were file I/O to read in the test cases and print them out. It worked perfectly on the sample input, and the small input file.

…woe was me when I tested it on the large input. A line with 500 characters, most of which were of the form “wwwwwwwweeeeeeeeeeelllllllllllllccccccccccooooooommmmmeeeeeee …” (use combinatorics to figure out how obscenely massive the number of subsequences would be). As best I can tell, my elegant recursive solution blew its stack – it couldn’t even finish the first test case of about 100.

Dan, too, had the same problem. We both solved the small input, but timed out on the large one. Needless to say, the overhead of recursion in this case proved too much – an iterative solution is what is required.

Conclusion

With only minutes left in the qualification round, it’s unlikely anyone could use these insights to their advantage. 😛 By answering one short input correctly (we got all three) and one large input correctly (those aren’t checked until after the round ends, but we’re pretty confident on the two we got), one is passed on to Round 1, where another batch of problems will be posed. This time around, however, the total points for correct answers and the time taken to achieve those correct answers will factor into who moves on to Round 2.

Dan and I weren’t working under any time constraints; we thoroughly worked each problem, stopping to bounce ideas off each other, waiting for the other to catch up, taking breaks, and also goofing around quite a bit, so we clocked in around several hours total. The current first place contestant finished all three problems in twenty-five minutes.

Of course, time starts from the moment one submits a solution to one of the problems, so this person probably worked all three questions and submitted them at the same time…but still. That’s beastly.

Man I love this stuff 😀 Kudos to Google for organizing Codejam again!

funny-pictures-cats-promise-to-not-touch-your-food

About Shannon Quinn

Oh hai!

View all posts by Shannon Quinn →

This entry was posted in Internet, lolcat, Programming, Technology and tagged algorithms, c++, codejam, danimal danimal sexy animal, google, holy crap im an uber geek, im that guy who makes jokes in the comments of source code, java, longest common subsequence, machine learning, massive powers of two = death to recursion, seeplusplus, traveling salesperson, watershed segmentation. Bookmark the permalink.

19 Responses to Code! Jam!

alex Pi says:

September 3, 2009 at 8:23 pm

Hi there, I also tried the iterative solution, using a stack u can simulate recursiveness… but it was not enough.

Any other idea ?

Grats!

Reply
- magsol says:
  
  September 3, 2009 at 8:31 pm
  
  Hi Alex,
  
  I thought about using a stack, and I may try that if I can’t flesh out the idea I’m working on now. It basically works as I mentioned above:
  
  Start a loop on the find string (“welcome to code jam”). For each character c, find all occurrences of c in the search string (long-ass random string). Whenever you find c, create a substring of the search string, starting at the index of c and continuing through the end. Make sure that substring contains at least one sequence of the characters in the find string after c.
  
  If it does contain at least one subsequence, then increment your counter by 1, and try to find another unique occurrence of c in the search string. Repeat this process until you’ve examined all characters c in the search string.
  
  Once you’ve finished searching the search string for c, increment your pointer variable in the find string to the next character c, and repeat the whole process again!
  
  I haven’t had a chance to really flesh this out and see if it works yet, but I think the theory is sound.
  
  Reply
- magsol says:
  
  September 3, 2009 at 8:39 pm
  Pseudocode might look like this:
```
string find = "welcome to code jam"
string search = {blah blah blah whatever}
int array = new int[find.length]

for (each character c in find) {

  // this counts how many times the character c
  // is part of a valid subsequence
  int totalSubSequences = 0

  for (each character c2 in search) {
    if (c == c2) { 

      // this means we've found a matching character,
      // so ensure there is at least one subsequence 
      // of the remaining characters after it

      string subfind = find.substring(indexOf(c) + 1)
      string subsearch = search.substring(indexOf(c2) + 1)

      if (subsearch.hasSubSequence(subfind)) {
        totalSubSequences++
      }
    }
  }
  array[indexOf(c)] = totalSubSequences
}

output product(array)
```
  I think that should work. All the function hasSubSequence does is make sure all the characters in subfind occur at least once in their correct order in the subsearch string (definition of a sequence: a particular ordering of characters that aren’t necessarily consecutive).
  Reply
  - Kevin says:
    
    September 3, 2009 at 10:17 pm
    
    actually, if you see how you count the totalSubSequences, you’ll notice that you only add 1 every time.
    that means your algorithm takes at least as long as subsequences found.. and that is a lot.
    you cant find the subsequences one-by-one, there has to be some product somewhere..
  - magsol says:
    
    September 3, 2009 at 10:22 pm
    
    @Kevin: Yeah, you’re absolutely right. I forgot to include the fact that you need to track how many times each individual character in the find string is part of a valid subsequence. I figure an array of integers, with the size of the array = length of find; each index in the integer array corresponds to the count of the character in the same index in the find string.
    
    Then, once the loop finishes, simply multiply all the numbers in the array together for the final count. Combinatorics 101.
    
    Good catch, thanks.
  - Kevin says:
    
    September 4, 2009 at 12:05 am
    
    @magsol: mm.. im not sure i understud your new algorithm, but i think it doesnt work.. you have to consider the relative positioning between the letters too..
    my recursive algoritm was like this:
    
    def ocurr(phrase,text):
    if len(phrase)==1: return text.count(phrase[0])
    res=0
    for i in range(1,len(text)-len(phrase)+2):
    if phrase[0]==text[i-1]:
    res+=ocurr(phrase[1:],text[i:])
    return res
    
    (it only works for the small set)
    
    later i found a -very- nice algoritm:
    
    welcome=”welcome to code jam”
    line=f.readline()
    count=[1]+[0]*len(welcome)
    for c in line:
    for w in reversed(range(1,len(welcome)+1)):
    if welcome[w-1]==c:
    count[w]+=count[w-1]
    print str(count[-1])[-4:].zfill(4)
    
    🙂
  - Kevin says:
    
    September 4, 2009 at 12:08 am
    
    oops, without spacing it doesnt make any sense.
    here again the algorithms:
    
    RECURSIVE:
    def ocurr(phrase,text):
    ··if len(phrase)==1: return text.count(phrase[0])
    ··res=0
    ··for i in range(1,len(text)-len(phrase)+2):
    ····if phrase[0]==text[i-1]:
    ······res+=ocurr(phrase[1:],text[i:])
    ··return res
    
    (it only works for the small set)
    
    NICE ONE:
    welcome=”welcome to code jam”
    line=f.readline()
    count=[1]+[0]*len(welcome)
    for c in line:
    ··for w in reversed(range(1,len(welcome)+1)):
    ····if welcome[w-1]==c:
    ······count[w]+=count[w-1]
    print str(count[-1])[-4:].zfill(4)
    
    sorry
- magsol says:
  
  September 3, 2009 at 9:16 pm
  
  Actually, my method is flawed. You need to also account for the fact that, as the main loop (which iterates through the find string) increments, you need to start the inner loop (loops through the search string) at incrementally larger indices – specifically, after the first occurrence in search of the character that the previous value of the counter of the outer loop pointed to in find that was found in a valid subsequence.
  
  Man. That bold part is uber confusing. Sorry =/
  
  Reply
Rob says:

September 3, 2009 at 9:07 pm

The first question might as well have been “do you know what a fuckin FSA is?” And your answer was “no but i can kludge some shit together anyway”

Reply
- magsol says:
  
  September 3, 2009 at 9:10 pm
  
  I’m not going to use freaking FSA for doing something a well-formed regex and a couple calls to “string.contains()” could do.
  
  Reply
Rob says:

September 3, 2009 at 9:48 pm

this may come as a shock to you but a RE is just a way of specifying the language accepted by a FSA

Reply
- magsol says:
  
  September 3, 2009 at 9:51 pm
  
  No shit sherlock. Hence my previous comment: I wasn’t going to implement a standalone FSA when regex would do just fine.
  
  Reply
nananash says:

September 7, 2009 at 12:58 pm

Do notice that the contest is supposed to be resolved individually without cooperation, I doubt you two would have been unable to advance to round 1 alone, so I am not going to report you dudes or anything… Just make sure to work individually in round 1 🙂

Reply
- magsol says:
  
  September 7, 2009 at 3:59 pm
  
  In re-reading the rules, you are absolutely correct, and I do apologize…I know the excuse “I had no idea” isn’t valid, but we were both under the impression that as long as we didn’t share code – and given that he used C++ while I used Java, that was indeed the case – collaboration was ok. Grad school has been rubbing off on me to the point of making dangerous assumptions, it would seem.
  
  Thanks for the heads-up. He and I didn’t sign up for the same first-choice Round 1 time slots, but if on the off-chance we are given the same time bloc, we’ll be extra-sure not to collaborate at all. It’ll be a good opportunity to prove once and for all who is the better coder 😛
  
  Reply
  - Kevin says:
    
    September 7, 2009 at 7:00 pm
    
    hey!
    look at the date-time.
    that was after the qualification round had finished..
    dont worry, i dont talk to anybody during the comp 😉
    (just like to discuss problems afterwards, thanks!)
shastabeast says:

September 10, 2009 at 5:50 pm

Nubnub

Reply
- magsol says:
  
  September 10, 2009 at 10:55 pm
  
  Poor nub.
  
  Reply
Amol Bhave says:

September 12, 2009 at 10:59 am

Just replace the ( and ) with [ and ]
in problem A

And take the inputs and match

Reply
- magsol says:
  
  September 12, 2009 at 9:23 pm
  
  That is a phenomenal solution! Thank you.
  
  Reply

Code! Jam!

Problem A: Alien Language

Problem B: Watersheds

Problem C: Welcome to Code Jam

Conclusion

About Shannon Quinn

19 Responses to Code! Jam!

Leave a reply to magsol Cancel reply

Interesting Folks

My Other Blogs

Twit twit twit

Calendar of mindless drivel

Email Subscription

Input	Output
`3 5 4 abc bca dac dbc cba (ab)(bc)(ca) abc (abc)(abc)(abc) (zyx)bc`	`Case #1: 2 Case #2: 1 Case #3: 3 Case #4: 0`

Input	Output
`5 3 3 9 6 3 5 9 6 3 5 9 1 10 0 1 2 3 4 5 6 7 8 7 2 3 7 6 7 7 6 7 5 5 1 2 3 4 5 2 9 3 9 6 3 3 0 8 7 4 9 8 9 8 5 6 7 8 9 2 13 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8`	`Case #1: a b b a a b a a a Case #2: a a a a a a a a a b Case #3: a a a b b b Case #4: a a a a a a a b b a a b b b a a b b b a a a a a a Case #5: a b c d e f g h i j k l m n o p q r s t u v w x y z`

Input	Output
`3 elcomew elcome to code jam wweellccoommee to code qps jam welcome to codejam`	`Case #1: 0001 Case #2: 0256 Case #3: 0000`

Code! Jam!

Problem A: Alien Language

Problem B: Watersheds

Problem C: Welcome to Code Jam

Conclusion

Share this:

Related

About Shannon Quinn

19 Responses to Code! Jam!

Leave a reply to magsol Cancel reply

Interesting Folks

My Other Blogs

Twit twit twit

Calendar of mindless drivel

Email Subscription