Here it is: this is what I’m studying currently for my third and final midterm in Computational Biology.
This is an exploration in Hidden Markov Models. For those of you who really aren’t interested in the Wikipedia mumbo jumbo, an HMM is essentially a way of representing a possible set of hidden states using whatever we can observe in conjunction with a lot of statistical inference and probability statements.
Picture this: you’re at a casino, and you suspect the house may be playing with a loaded die. As we all know, a fair six-sided die has probability 1/6 for any of the sides landing face up. This loaded die, however, has probability 1/2 of landing on 6, and probability 1/10 of landing on any of the other five faces. Unfortunately, the state of the die being used is hidden from you; all you can observe is the sequence of rolls. Can we infer the sequence of die usages based on the rolls that we observe?
HMMs allow us to take a pretty good stab at this problem. Given:
-an observed sequence of outputs (die rolls)
-possible states (loaded die or fair die)
-state transition probability (the chances that the house will switch from fair to loaded, and vice versa)
-output probabilities given the state (chances of rolling a particular number with a fair die, versus chances of rolling that same number with the loaded one)
We can construct a model which will allow us to reasonably estimate the possible sequence of states that we cannot otherwise directly observe.
Here’s the crux: the math behind it all is terribly complex. Let’s start by defining a few functions.
alk: The probability of transitioning from state l to state k (e.g. switching from loaded to fair die).
ek(i): The probability of emitting character i, given that we’re in state k (e.g. probability of rolling a 6 given that the current die is fair)
Still with me? Good!
The next best thing is utilize a little something called the Viterbi algorithm: this computes, given all of the above inputs, what the most probable sequence of hidden states is. Ready for this?
Breaking it down, this computes the most probable sequence of states given that the sequence ends in state k and character i. It does this by computing the emission probability of this character in this state, and multiplying that by the maximum probability of all the other possible previous states and emissions from which this state could have transitioned. This number is then multiplied by the transition probability of the previous state l to the end state k. The recursive call to this algorithm is where everything gets murky, but we’ll do this the inductive way and say that it works. Magically. 😛
Clear as mud? Fabulous! Because we have yet another step to go. Using this Viterbi algorithm as a starting point, we can generalize this to generating a probability for arriving in some state with some emission from any possible path, not just the most likely one. Simply put, this next algorithm determines how likely it would be to, say, roll a 6 with a loaded die on the 10th roll. This is called the forward algorithm, and it looks like this:
Digesting this, it’s actually syntactically almost identical to the Viterbi algorithm above, except the “max” from Viterbi is replaced with the summation, since now we want the sum of all the possible probabilities, not just the single maximum probability. Other than that, it’s identical, and equally confusing once you hit the recursive (i – 1) call. 😛
Finally, the concept of the forward algorithm leads us to one more intuitive leap. We have a method for generating the single most probable path of states and outputs given a final state and output (Viterbi), a method for computing the overall likelihood of arriving at a particular state with a particular output (forward)…how about a method for computing the most probable sequence of states and outputs given a starting state and output?
Yes, dear reader, this method will be creatively named the backward algorithm. 😛 It is as follows:
This is similar to the forward algorithm syntactically, but different enough to induce a pause. Again, given a starting state k and initial output i, this computes the probability of transitioning from the starting state k to any of the other states l, multiplies this by the probability of emitting the next character xi + 1 from within state l, and then yet once more calls itself recursively over all the following positions in the sequence.
SO! Given the entire sequence, and the state at some point within that sequence, we can calculate the probability of that state at that location in the sequence:
This is, quite literally, the probability that the state at position i in the sequence (given the sequence itself, x) is k. It consists of the forward algorithm (probability of arriving at that state) multiplied by the backward algorithm (probability of the rest of the sequence given the starting point), all divided by the overall probability of the sequence itself.
(for the astute, this is essentially a specific application of Bayes’ Rule)
STILL WITH MEH??? 😀
So! At this point, we have all these great algorithms and methods for finding most likely paths and probable locations in the pathway and methods for galactic domination, but the cold harsh reality is that our information is usually incomplete (the fact that “Hidden” is in the original name would be your first hint). As such, there’s a lot of estimation and inference involved in computing these probabilities. In particular, we really have no way of knowing the probability of changing hidden states, e.g. how likely the casino worker is to switch between the loaded die and the fair die. We also have no way of knowing exactly how likely a particular state is to give rise to a particular output, e.g. how likely a loaded die is to land on one of its faces versus the others. All we have to go on, yet again, is the observed sequence of rolls.
Using this sequence, though, we can come up with some pretty good estimates of those probabilities. First on the list: estimating the transition probabilities.
You should recognize the akl term from the previous algorithms, but the capital A represents the number of times we observe state k switching to state l. The denominator is the sum of all possible transitions from state k to any other state.
Of course, you’re saying, “BUT WE DON’T KNOW HOW MANY TIMES K SWITCHES TO L!” And you are absolutely correct, dear reader. So, we have a way of estimating that as well!
This is insanely complicated, so let me just say that, given a set of training sequences (say, a bunch of die rolls), this computes an estimate for capital A.
And yes, I know you’re screaming yet again that you see calls to the forward and backward algorithms in there, which require the very values we’re trying to estimate here. There’s a little trick statisticians use to get these estimates off the ground called a priori estimates: basically, we pick an arbitrary value as our starting point and go from there. This way, over several iterations, our estimates are narrowed and become more and more accurate over time.
We pick an arbitrary value for transition probabilities (make a guess for the likelihood that the casino worker switches dies). We then use that guess in the above algorithm, using it on the sequences of rolls we observe. This becomes our estimate for capital A, which we then use to compute a new guess for the transition probability between states.
AND THE PROCESS REPEATS!
Of course, we also need the same estimates for the emission probabilities:
And we need a way to compute the number of times we emit a character in a particular state:
And yet again, we start the emission probability with some arbitrary value (take a guess at the probability of a rolling a particular number with the loaded die) and run the algorithm multiple times, narrowing the estimate until it approaches something close to the true probability. For each sequence, this computes the likelihood of a particular output in a particular state by using the forward and backward algorithms, then divides by the probability of the entire sequence, giving you the “expected value” for the character at that location in that state.
This method of estimation, ladies and gentlemen, is duly named the Baum-Welch algorithm. *applause*
So. I’m going to read over this entry a few times until I understand it all. Care to join me? 🙂