At the end of class on Thursday we created an HMM for genefinding that could deal with the issue of codon frame. Some students came up after class and pointed out that the modular construction we had used more states than necessary. Draw and HMM module that does what the module we had on the board does, but has fewer states. Try to get it down to as few states as possible, without losing the acuracy of the model.
Here is a summary of what we had, to remind you of where you should be starting from.
First, we had a three-state HMM module for the start codon, call this start_codon. We had a similar three-state HMM module for the stop codon, which you should call stop_codon. This stop codon model used emission probability distributions in which the residue emitted depends on previous residues in the observed sequence (See pages 75 and 76 of the text. I called these "Higher order Markov emission probability distributions"). For the purposes of answering this question, you can assume that these emission distributions are given; you don't have to reproduce them. Next, we had a three-state HMM module for an internal codon, which might be called internal_codon. I'd like to think of the internal-codon HMM module as the concatenation of three very simple HMM modules: C0, C1, and C2. Each of these is a one-state HMM that models a single base. C0 models the first base of an internal codon, C1 the second base, and C2 the third base. Each of these HMM modules also use unspecified higher order Markov emission probability distributions that you do not need to worry about.
We put these three modules together to form an HMM for the coding region of a prokaryotic gene, which we can call coding_prok. The HMM module coding_prok consisted of start_codon followed by internal_codon with a loop transition on it so it could be repeated, followed by stop_codon. It looked like
start_codon -> internal_codon -> stop_codon
^ |
------------------
To recognize the coding region of a eukaryotic gene (including the introns), we needed a module for an intron. We defined it with "I", but never specified its detailed state structure. Your first task is to specify the state structure for this HMM module. One simple view of introns is that they are sequences of DNA that begin with the bases GT and end with the bases AG, and can have any sequence of bases inbetween, each internal base distributed according to the background probabilities of the intron bases, which you can assume to be uniform. Draw an HMM module for such a definition of an intron. You may assume that the number of bases between the GT and the AG has a geometric distribution, and thus can be modeled using a simple loop transition. Set the probability of this loop transition so the average number of bases between the GT and the AG for your intron module is 1000. In fact, introns have a more complicated structure, so for extra credit, you may draw an HMM for a more sophisticated intron model. Call your intron model basic_intron (or fancy_intron if you do a fancier intron.)
To build an HMM for eukaryotic coding regions, including introns, we created 15 modules. These included three modules for the beginning exon, which we called BE0, BE1 and BE2. These were defined as
start_codon -> internal_codon
^ |
------------------
start_codon -> internal_codon -> C0
^ |
------------------
start_codon -> internal_codon -> C0 -> C1
^ |
------------------
internal_codon -> stop_codon
^ |
------------------
C1 -> C2 -> internal_codon -> stop_codon
^ |
------------------
C2 -> internal_codon -> stop_codon
^ |
------------------
C2 -> internal_codon -> C0
^ |
------------------
Draw the HMM module we built for eukaryotic coding regions that we discussed at the end of class using these modules. Then use the more primitive modules, like "C0", "internal_codon", etc., as defined above, to define an HMM module coding_euk that has as few states as possible. You can draw this as a high level drawing, naming modules and refering to them, so long as you have a separate drawing of each module you refer to. Put in the transition probabilities as best you can. For extra credit: add modules for the 5' and 3' UTR regions, or other features you want.