The challenges of dynamic programming

According to its Wikipedia entry:

[D]ynamic programming is a method of solving problems that exhibit the properties of overlapping subproblems and optimal substructure[.]

My current assignment is to design a program which analyzes two proteins – specifically, their amino acid sequences – to uncover possible similarities.  This includes finding as many matching sequences as possible between them.

Naively, finding and comparing every subsequence within one protein and comparing it to every other subsequence in the other protein is an extremely time-consuming task.  We’re talking an exponential number of iterations, relative to the lengths of the proteins we are analyzing.  Even adding a single amino acid will greatly increase the number of possible sequence combinations.

So, let’s have dynamic programming take a crack at it.

For the astute (and, most likely, algorithmically-inclined), you will notice that in the above naive approach I’ve outlined, a large percentage of the numerous calculations are performed over and over.  For the interested folks who aren’t familiar with algorithms, take this example: factoring a given number, say 42.  You start at the lowest number (1) and work your way up.

But there comes a point where you don’t need to keep asking yourself “Is 42 divisible by this number?”  1, 2, 3, 6…sure, you can continue with “Does 7 go into 42? And 8? And 9?…” and so on, all the way up to 42.  But if you recognize that 6*7 = 42, congratulations, you’re more algorithmically-minded than you might think.  This is, in a way, how dynamic programming works: it uses previously computed answers in future calculations, so as to use the most efficient computational pathway possible to reach the correct answer.

More specifically: in the protein example, if we have two proteins of lengths n and m amino acids respectively, using dynamic programming to generate matching sequences should take no more than O(nm) iterations, as opposed to the naive, combinatorially-intense approach, which would yield a running time on the order of O(nm).

For both groups of people, another example: building the Fibonacci sequence.  Everyone knows how that works, right? 🙂

1, 1, 2, 3, 5, 8, 13… Each number is the sum of the two numbers before it.  So, if we had a function that computed the xth term in the sequence, that function would look something like this:

Fibonacci(x) = Fibonacci(x – 1) + Fibonacci(x – 2)
Fibonacci(1) = 1
Fibonacci(0) = 0

Let’s use the example of Fibonacci(5) (which means we’re computing the 5th term in the sequence, which should give us 5).

  • Fibonacci(5)
  • Fibonacci(4) + Fibonacci(3)
  • (Fibonacci(3) + Fibonacci(2)) + (Fibonacci(2) + Fibonacci(1))
  • ((Fibonacci(2) + Fibonacci(1)) + (Fibonacci(1) + Fibonacci(0))) + ((Fibonacci(1) + Fibonacci(0)) + Fibonacci(1))
  • (((Fibonacci(1) + Fibonacci(0)) + Fibonacci(1)) + (Fibonacci(1) + Fibonacci(0))) + ((Fibonacci(1) + Fibonacci(0)) + Fibonacci(1))

Given the final problem, replace each Fibonacci(1) with 1, and each Fibonacci(0) with 0, and you have 5, which is the answer we wanted.  However, you can very clearly see how inefficient this is: for example, Fibonacci(2) was calculated three separate times from scratch.  For larger values of x, the number of repetitions could become enormous, leading to many superfluous calculations.

I know what you’re thinking.

You’re thinking, so calculate a particular value once and use it several times!

Well, that’s half the answer. 😛  The other half is to use a bottom-up approach (the naive approach was top-down) so that we only have to calculate a particular value once, and we only have to actually use it once.  In this example, we’d use Fibonacci(0) and Fibonacci(1), and use them to calculate Fibonacci(2).  Fibonacci(2) could then be used to calculate Fibonacci(3) and Fibonacci(4), and so on.  Suddenly, we’ve gone from exponential computation to linear time.

And so it goes with amino acid comparison: when aligning the sequences, find every possible path of alignment by visiting each possibility only once, and using the maximal possibility to then branch out from that location.  It’s an adaptation of the traveling salesman problem.

Unfortunately, like most efficient algorithms, there is a certain amount of ease of implementation that is traded for efficiency.  And that is where I am at the moment. 😛  This assignment is due tomorrow, and while I have a good portion of it completed, I have not yet managed to trace out the optimal path of matching amino acids.

Oy.

In other news, I had my final interview with Joshua Woods at IBM, who would be my supervisor in the extremeblue program if I was to be selected from the finalist pool to work for them this summer.  I think it went fairly well, but I’ll find out for sure in late March.  Keep your fingers crossed…I really really REALLY want this internship.

I have submitted quite a few other applications for internships, all of which look extremely interesting; really, I’d be happy at any of them.  I am just hoping for a few nibbles in the next few weeks, particularly given this economy.

Advertisements

About Shannon Quinn

Oh hai!
This entry was posted in Academics, Graduate School, Mathematics, Programming and tagged , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s